13
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 47, NO. 1, JANUARY 2012 151 Circuit and Physical Design Implementation of the Microprocessor Chip for the zEnterprise System James Warnock, Senior Member, IEEE, Yiu-Hing Chan, Sean Carey, Huajun Wen, Member, IEEE, Pat Meaney, Guenter Gerwig, Member, IEEE, Howard H. Smith, Yuen Chan, Member, IEEE, John Davis, Paul Bunce, Antonio Pelella, Dan Rodko, Pradip Patel, Thomas Strach, Doug Malone, Frank Malgioglio, José Neves, Associate Member, IEEE, David L. Rude, and William Huott Abstract—This paper describes the circuit and physical design features of the z196 processor chip, implemented in a 45 nm SOI technology. The chip contains 4 super-scalar, out-of-order pro- cessor cores, running at 5.2 GHz, on a die with an area of 512 mm containing an estimated 1.4 billion transistors. The core and chip design methodology and specific design features are presented, focusing on techniques used to enable high-frequency operation. In addition, chip power, IR drop, and supply noise are discussed, being key design focus areas. The chip’s ground-breaking RAS features are also described, engineered for maximum reliability and system stability. Index Terms—Cache set predict, chip integration, chip IR drop, chip supply noise, circuit design methodology, clock distribution, clock grid, CMOS digital integrated circuits, design for reliability, design for test, digital circuits, high-frequency CMOS design, mi- croprocessor test, microprocessors, power efficiency, RAIM, RAS, reliability, SRAM, system z, VLSI design, zEnterprise, zEnterprise 196, z196, 45 nm SOI. I. INTRODUCTION The z196 processor chip at the heart of the zEnterprise system is introduced and described, focusing on specific design tech- niques and circuits that allowed the chip to reach an operating frequency of 5.2 GHz. After an overview of the core and chip design, details of the core circuit design methodology are dis- cussed. Several special circuit topics are covered in detail, in- cluding aspects of the core array design and the D-cache set-pre- dict circuitry. At the chip level, RAS features are described, along with aspects of the global design important to high speed operation, including wiring and buffering solutions, power and IR drop analysis, and supply noise issues. Manuscript received April 13, 2011; revised June 19, 2011; accepted August 22, 2011. Date of publication October 31, 2011; date of current version De- cember 23, 2011. This paper was approved by Guest Editor Tanay Karnik. J. Warnock is with the IBM Systems and Technology Group, Yorktown Heights, NY 10598 USA (e-mail: [email protected]). Y.-H. Chan, S. Carey, P. Meaney, H. Smith, Y. Chan, J. Davis, P. Bunce, A. Pelella, D. Rodko, P. Patel, D. Malone, F. Malgioglio, J. Neves, D. Rude, and W. Huott are with the IBM Systems and Technology Group, Poughkeepsie, NY 12601 USA. H. Wen is with the IBM Systems and Technology Group, Austin, TX 78758 USA. G. Gerwig and T. Strach are with the IBM Research & Development GmbH, Boeblingen 71032, Germany. Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/JSSC.2011.2169308 Fig. 1. Chip die photo. II. CHIP AND CORE OVERVIEW A. The z196 Chip The z196 chip [1] was designed in IBM’s high-performance 45 nm CMOS technology [2], with parameters and devices sim- ilar to those described in earlier work [3], but with several fea- tures added to support the high-frequency design point. Two thick, upper-level wiring planes were added (at 4X the minimum wire pitch of the lowest levels) to improve latency on wire buses in the processor core, and to provide high-bandwidth, low-la- tency access to the L3 DRAM cache. Thus a total of 13 levels of metal interconnect were used in the design. In addition, a fourth device threshold voltage (VT) option was provided, “low VT”, or LVT, for closure of the most critical timing paths. LVT gates were about 8% faster than their equivalent regular VT counter- part, but came with a substantial leakage penalty (about 8X more leakage than RVT). A die photo of the chip is shown in Fig. 1. The chip floor- plan contains four dense CPU cores each with a dedicated L2 SRAM cache (1.5 MB). There are also two co-processors on the chip used for data compression and encryption. The middle column of the chip is occupied by a 24 MB DRAM L3 cache that is shared by all 4 cores. Finally, there is a main memory control unit (MCU) on the left side of the chip, and an I/O bus controller (GX) on the right. Much of the timing and routing complexity at the chip level comes from the fact that not only is the L3 cache shared by all processor cores, but it also acts as a ‘traffic cop’ for the whole chip. The L3 is the control center for data coming in and out of the cores, the MCU, the I/O controller, 0018-9200/$26.00 © 2011 IEEE

06064913

Embed Size (px)

Citation preview

Page 1: 06064913

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 47, NO. 1, JANUARY 2012 151

Circuit and Physical Design Implementation of theMicroprocessor Chip for the zEnterprise System

James Warnock, Senior Member, IEEE, Yiu-Hing Chan, Sean Carey, Huajun Wen, Member, IEEE,Pat Meaney, Guenter Gerwig, Member, IEEE, Howard H. Smith, Yuen Chan, Member, IEEE, John Davis,Paul Bunce, Antonio Pelella, Dan Rodko, Pradip Patel, Thomas Strach, Doug Malone, Frank Malgioglio,

José Neves, Associate Member, IEEE, David L. Rude, and William Huott

Abstract—This paper describes the circuit and physical designfeatures of the z196 processor chip, implemented in a 45 nm SOItechnology. The chip contains 4 super-scalar, out-of-order pro-cessor cores, running at 5.2 GHz, on a die with an area of 512 mm�

containing an estimated 1.4 billion transistors. The core and chipdesign methodology and specific design features are presented,focusing on techniques used to enable high-frequency operation.In addition, chip power, IR drop, and supply noise are discussed,being key design focus areas. The chip’s ground-breaking RASfeatures are also described, engineered for maximum reliabilityand system stability.

Index Terms—Cache set predict, chip integration, chip IR drop,chip supply noise, circuit design methodology, clock distribution,clock grid, CMOS digital integrated circuits, design for reliability,design for test, digital circuits, high-frequency CMOS design, mi-croprocessor test, microprocessors, power efficiency, RAIM, RAS,reliability, SRAM, system z, VLSI design, zEnterprise, zEnterprise196, z196, 45 nm SOI.

I. INTRODUCTION

The z196 processor chip at the heart of the zEnterprise systemis introduced and described, focusing on specific design tech-niques and circuits that allowed the chip to reach an operatingfrequency of 5.2 GHz. After an overview of the core and chipdesign, details of the core circuit design methodology are dis-cussed. Several special circuit topics are covered in detail, in-cluding aspects of the core array design and the D-cache set-pre-dict circuitry. At the chip level, RAS features are described,along with aspects of the global design important to high speedoperation, including wiring and buffering solutions, power andIR drop analysis, and supply noise issues.

Manuscript received April 13, 2011; revised June 19, 2011; accepted August22, 2011. Date of publication October 31, 2011; date of current version De-cember 23, 2011. This paper was approved by Guest Editor Tanay Karnik.

J. Warnock is with the IBM Systems and Technology Group, YorktownHeights, NY 10598 USA (e-mail: [email protected]).

Y.-H. Chan, S. Carey, P. Meaney, H. Smith, Y. Chan, J. Davis, P. Bunce,A. Pelella, D. Rodko, P. Patel, D. Malone, F. Malgioglio, J. Neves, D. Rude,and W. Huott are with the IBM Systems and Technology Group, Poughkeepsie,NY 12601 USA.

H. Wen is with the IBM Systems and Technology Group, Austin, TX 78758USA.

G. Gerwig and T. Strach are with the IBM Research & Development GmbH,Boeblingen 71032, Germany.

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/JSSC.2011.2169308

Fig. 1. Chip die photo.

II. CHIP AND CORE OVERVIEW

A. The z196 Chip

The z196 chip [1] was designed in IBM’s high-performance45 nm CMOS technology [2], with parameters and devices sim-ilar to those described in earlier work [3], but with several fea-tures added to support the high-frequency design point. Twothick, upper-level wiring planes were added (at 4X the minimumwire pitch of the lowest levels) to improve latency on wire busesin the processor core, and to provide high-bandwidth, low-la-tency access to the L3 DRAM cache. Thus a total of 13 levels ofmetal interconnect were used in the design. In addition, a fourthdevice threshold voltage (VT) option was provided, “low VT”,or LVT, for closure of the most critical timing paths. LVT gateswere about 8% faster than their equivalent regular VT counter-part, but came with a substantial leakage penalty (about 8X moreleakage than RVT).

A die photo of the chip is shown in Fig. 1. The chip floor-plan contains four dense CPU cores each with a dedicated L2SRAM cache (1.5 MB). There are also two co-processors onthe chip used for data compression and encryption. The middlecolumn of the chip is occupied by a 24 MB DRAM L3 cachethat is shared by all 4 cores. Finally, there is a main memorycontrol unit (MCU) on the left side of the chip, and an I/O buscontroller (GX) on the right. Much of the timing and routingcomplexity at the chip level comes from the fact that not only isthe L3 cache shared by all processor cores, but it also acts as a‘traffic cop’ for the whole chip. The L3 is the control center fordata coming in and out of the cores, the MCU, the I/O controller,

0018-9200/$26.00 © 2011 IEEE

Page 2: 06064913

152 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 47, NO. 1, JANUARY 2012

Fig. 2. Processor core logic overview.

and off chip interfaces to the L4 cache chips which connect toother nodes of the system. All of these data connections funnelin and out of the L3, which controls where the data is sent andin which cycle. This led to many timing and wireability chal-lenges, which were part of the motivation for the addition ofextra high-performance metal layers to the technology as de-scribed above. Also, challenges arose during the implementa-tion of the multiple voltage domains and clock grids on the chip,described in later sections. Timing penalties were required atevery voltage domain crossing, and extra care was needed whilerouting power rails and placing buffers around these bound-aries. As for the various clock regions, each region was care-fully planned with custom clock wiring solutions tailored tomeet the stringent project clock skew specifications needed toenable high frequency operation.

In all, the chip occupies an area of 512 , with 1.4 billiontransistors, and over 6 km of wire interconnect. It uses 9227C4 connections (1134 of which are signal IOs, with the restfor power and ground). Over one million repeaters were usedin order to maintain signal slew rates.

B. The z196 Core: Logic Overview

The z196 microprocessor is an aggressive superscalar, outof order execution design (Fig. 2) [4], [5]. It fetches, decodesand dispatches up to three system-z instructions per cycle.During decode, complex instructions are cracked into 2 or moremicro-operations, up to three of which are dispatched per cycleand stored in a 40 entry issue queue. The out of order issuewindow is thus 40 instructions or micro-operations. Duringdispatch the source architected register(s) specified in theinstruction text are mapped to physical registers and the targetarchitected register(s) are assigned a new physical register.

Up to five micro-operations are issued and executed per cycle.The oldest micro-operation whose data, condition code and/orcontrol dependencies have been resolved, is issued to one offive execution pipelines: two fixed-point (or integer) pipelines,two load-store pipelines or a floating point unit with binaryand decimal pipelines. A global completion table holds 24groups of up to 3 micro-operations each. This table is written inprogram order and completes groups in-order; a group is com-pleted when all of its micro-operations have finished executionwithout error and all prior groups have completed.

Asynchronous dynamic branch prediction logic generallyexecutes ahead of instruction fetching, providing prefetchingof instruction text into the first level I-cache. Branch predictionalso steers instruction fetch to proceed down the predictedtaken path. Branch prediction includes a branch target bufferand changing target buffer to hold the predicted branch targetaddresses. Branch directions are predicted by two bit saturatingcounters within the branch target buffer and within a patternhistory table indexed by global branch history. The z196 corecontains three private caches: a first level 64 KB I-cache, firstlevel 128 KB D-cache and a combined second-level 1.5 MBcache. The level 3 cache is shared by all four cores on the chip.

Two page sizes (4 KB and 1 MB) are supported by both thefirst level data and instruction translation look-aside buffers(TLBs). The core also contains a combined second-level TLB.Sixteen physical store queue entries are implemented in theload store pipeline to support forwarding of recent store datato younger load instructions. These load-hit-store events arecommon since system-z architecture includes many storageto storage instructions and system-z code is non RISC-like,holding intermediate execution results in storage instead ofregisters.

Page 3: 06064913

WARNOCK et al.: CIRCUIT AND PHYSICAL DESIGN IMPLEMENTATION OF THE MICROPROCESSOR CHIP FOR THE zEnterprise SYSTEM 153

III. CORE CIRCUIT DESIGN

A. Core Circuit Design Methodology

A hierarchical design style was adopted for z196, where thecore was built from a collection of functional units which ex-isted as hard, physical entities, which in turn were built fromcollections of “macros”. A macro, typically containing 10 K–50K transistors, was the lowest level of abstraction, and was ana-lyzed and checked flat to the transistor level. A variety of manualand automatic optimization techniques, combined with a set ofrecommended design practices and specific design rules, wereapplied on macro and higher routing levels to achieve the highfrequency target, as described in more detail in the paragraphsbelow.

Core custom dataflow macros were designed with 2-param-eter (n-width and p-width), static CMOS gates which could betuned with an in-house transistor level tuning tool, referred tohenceforth as simply the Tuner [6], to obtain optimal gate sizesduring pre- and post-layout design phases. Tapered gates with ataper ratio of 2.0 (i.e. the ratio of the non-critical device width(s)at the bottom of a series transistor stack, to the critical devicewidth at the top of the stack) were designed for a set of themost commonly used gates, providing an average delay speedupof 2–8% over similar non-tapered gates. These were used inthe carry-look-ahead circuits in the 64-bit operand adder in theFixed-Point Execution Unit (FXU) single cycle execution loopand in the address-generation (AGEN) adder in the D-Cache4-cycle loop [1].

During the pre-layout phase, the macro circuit schematic wasdrawn to include placement information for all gates along withwire code specifications for timing-critical nets. The placementinformation was used by a placement routine [7], to create ahierarchical, placed layout. Using the wire code specificationsalong with the layout information, parasitic RC networks wereestimated [7] based on Steiner analysis. The resulting anno-tated circuit netlist was used for pre-layout gate sizing and betaratio optimization with the Tuner to improve slack. The auto-mated flow from schematic through to fully-tuned, placement-based annotated netlists allowed designers to iterate rapidly ontheir designs, restructuring as necessary, as timing requirementsstabilized over time. In addition, rapid iteration on the macrofloor plan allowed the designers to focus on reduction of crit-ical internal/external RC delays, and overall parasitic load ca-pacitance. Wire width tuning and bottoms-up macro pin assign-ment, based on receiving and/or driving circuit locations, werealso used to further reduce overall wire RC delay.

The dataflow macro routing was typically done by firstpre-routing the clock and timing critical nets and thenauto-routing the rest of the nets. A small number of crit-ical macros were routed manually to ensure wireability andthe highest post route timing quality. Post-layout tuning wasdone with the Tuner using a layout-aware option [5] to restrictgate size changes, thereby avoiding any changes to the macrofloor plan. Threshold voltage optimization could also be doneat a very late stage in the design, since all VT options werecompletely footprint-compatible. Gates could be moved fromregular (RVT) to high VT (HVT), or to “super-high VT” (SVT)for power reduction. LVT gates were inserted into critical paths

TABLE IRELATIVE WIRE PITCH AND RC DELAY PARAMETERS

during this post-layout phase for the final push to reach thedesign frequency target. Overall, less than 1% of the total logicdevice width was swapped to LVT, which ended up accountingfor about 0.5% of the total logic DC leakage. Automated designtuning near the end of design closure resulted in about a 10%reduction of the overall circuit physical design power.

Control logic was partitioned into synthesizable random logicmacros (RLMs) according to logic function and macro size tar-gets. After construction of the RLM, the Tuner was used ex-tensively for post-layout, layout-aware tuning. After tuning, theresultant device widths could be binned back to the fixed li-brary cells, for ease of making future logic updates, or the designcould be converted to the parameterized cells (the final timingwas transistor-based in either case). In this way transistor sizesand VTs could be re-optimized as the surrounding timing envi-ronment changed, without impacting the RLM floorplan or evenany of the wire interconnects. Finally, LVT gates were swappedinto critical control paths after timing stabilization as the laststep to improve frequency.

A strategy to optimize wire interconnects is important forany high-frequency microprocessor design. The 13 layers ofmetal in this technology were engineered to provide an optimaltradeoff between density and performance as shown in Table I.Inter-macro and cycle-limiting global routes for timing criticalnets used the top three 4X and two 10X metal layers. These 4Xand 10X wiring planes provided up to a 4.4X wire RC delayreduction compared to the 2X wiring planes which were usedfor higher density, less-critical inter-macro connections. Sincethe two 10X layers were used heavily for clock, power, andIO routing; their usage for signal routing was reserved for fre-quency-limiting long nets, such as the connections between theAGEN adder output and the eight instances of D-Cache arrayaddress inputs in the 4 cycle loop. For these critical high-levelroutes, manual design work ensured direct connection from thedriver output pin to the upper level wire, improving the signalslew and minimizing the wire RC delay.

B. Core Array Design

Although the bulk of the z196 logic design was accomplishedwith regular CMOS static circuitry as described above, certainparts of the design, including register files and logic associatedwith SRAM array access were implemented with full-customdesign techniques, including dynamic circuits. The L1 cachearrays are a good example of this style, posing some of themost demanding circuit performance challenges. They providea single cycle read/write in the core 5.2 GHz frequency do-main. Achieving these frequency targets required optimizationof array organization, macro floor planning, interconnect RC

Page 4: 06064913

154 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 47, NO. 1, JANUARY 2012

Fig. 3. L1 D-cache critical 4-cycle access loop.

minimization and customized circuit and physical design. Care-fully tuned domino/dynamic circuits with RVT and HVT de-vices were used extensively in array peripheral critical paths toreduce access latency.

In spite of the priority focus on high-speed operation, signif-icant effort had to be put into power efficiency as well, in orderto meet the strict chip power target. Fine-grained clock gating,selective higher VT usage, circuit detuning wherever timingslacks allowed, and elimination of the half-selected memory cellstate (by getting rid of bit decode addressing schemes whereverpossible) were all used to lower array power dissipation. Withthese design techniques, the z196 arrays were able to lower theirpower consumption by up to 20–30% relative to the previous 65nm design, even while running almost 20% faster.

A relatively large, but high performance 6T cell (0.462 )was used for the core L1 arrays, along with a hierarchical bitline structure having 16 cells per local bit line, with a single-phase domino read head [8] optimized for power and perfor-mance. A key feature of the z196 arrays is the extensive pro-grammable timing control for all internal and external criticalpaths. This timing flexibility enabled hardware-based chip fre-quency tuning, described in a later section, to include array-lim-ited critical paths in order to attain maximum system perfor-mance, as well as ensuring robust hardware functionality anddesign margins.

Access to the z processor L1 cache is by way of a 4-cycleloop. Fig. 3 shows the L1 D-cache lookup path functional blockdiagram. The late-select 8:1 set hit signals to D-cache are gen-erated by the Set-Predict array (SETP), described in the nextsection. Both arrays are accessed in parallel in cycle A1 withoutputs spanning across the cycle boundary into A2, and sub-sequently captured by the formatter to proceed to the A3 cycle.Address generation (AGEN) in cycle A0 is a critical cycle lim-iting path feeding the arrays. Highly optimized static circuits,selective LVT usage, plus cycle stealing into A1, were used toachieve the core timing targets.

The 128 KB D-cache was partitioned into 8 macros with a2-stage access pipeline supporting 2 independent reads or onewrite per cycle. The macro logical organization is

Fig. 4. L1 D-cache macro. a) logical structure. b) macro physical view.

Fig. 5. L1 D-cache wordline access path circuits.

for read and for write. Fig. 4 showsthe macro subarray partition for reading and writing. It containseight subarrays of 256 wordlines by 72 bit columns each. For aread operation, either the upper or lower 4 subarrays are active.Each read port is 512 entries deep, 8-way set associative, andprovides 4 bytes of data (1 byte from each subarray) after thelate-select mux. For a write operation, all 8 subarrays are active,with 4 bytes of data written into the upper, and another 4 byteswritten into the lower subarrays. Fig. 5 is a simplified schematiccross-section of the wordline access circuitry. Domino circuits

Page 5: 06064913

WARNOCK et al.: CIRCUIT AND PHYSICAL DESIGN IMPLEMENTATION OF THE MICROPROCESSOR CHIP FOR THE zEnterprise SYSTEM 155

Fig. 6. (a) SETP macro physical image. (b) SETP array dual-phase R/W column circuitry.

are used throughout all timing critical paths to minimize latency.A dedicated SRAM power supply (VCS, typically 100–150 mvhigher than the normal logic power supply Vdd) is used in themajority of the peripheral circuits for improved performanceand low-voltage operation.

C. SETP Macro

The SET Predict (SETP) macro provides the late-selectcontrol to the D-cache macro, described above. TheSETP uses a 14 bit dynamic hit logic scheme with an embedded8 K bit SRAM [9]. The hit logic uses a “search-for-a-hit”scheme (i.e., XOR’s followed by AND functions, pre-chargedto a miss) to provide optimal performance and timing withminimal power dissipation. The embedded SRAM is organizedas a array and uses the same 6 T SRAMcell as the D-cache, in a “domino” hierarchical dual read bitline approach [10]. The SETP macro organization is shown inFig. 6(a)). The right and left quadrants share 32 wordline driverslocated in the center. The upper 4 sets are mirrored around acentral region to create the bottom 4 sets, with the 1st leveldecode, clocks, input latches, write and timing control logic inbetween. In order to deliver maximum read performance, the32-cell bitline stack was divided into four segments of eightcells per Local Bit Line (LBL) with dual-phase read heads(Fig. 6(b)). The global bit select circuit was placed in the centerwith two 8-cell subarrays above and two 8-cell subarrays belowto reduce RC effects on the global read bit lines. The gbitsel hasFast-Read-Before-Write (FRBW) blocking circuits to preventfalse reads during write operations [10] and also providesdual-phase dynamic outputs to drive directly into the 14-bitdynamic hit logic comparator. Both the SRAM and hit logic

functions are tested together by an Array-Built-In-Self-Test(ABIST) engine, resulting in comprehensive “at-speed” testcoverage to guarantee circuit functionality and timing margins.

IV. CHIP AND GLOBAL DESIGN CONSIDERATIONS

A. Unit and Chip Physical Integration Methodology

As mentioned earlier, the design structure was partitionedhierarchically, allowing concurrent design of each unit, wherea unit was typically made up of a mix of SRAM arrays, reg-ister files, custom dataflow macros and RLMs. The integrationteam needed to be able to make steady progress towards thechip power and frequency targets even as the logical structureof the design was evolving. A suite of integration design toolsmet these needs, tagging nets with special wire codes, addingbuffers, managing wire usage and blockages across the hier-archy, and performing timing analyses either incrementally, orwith a full refresh of the input data.

Wiring codes were assigned automatically at the start, witha length based algorithm used to initially upgrade longer nets.Later, timing feedback (and engineering judgment) was used tofurther optimize the wiring solution. Finally, many units, be-cause of the high density of wires, required manual planning andengineering for a number of buses. A limited amount of routingwas done on the highest (10X) planes, where inductive effectswere a potential concern. Although the timing model did not in-corporate a full analysis of inductive effects, buses on this levelwere analyzed in detail with an RLC model to make sure thatthe simpler timing model was accurate enough for these wires.

The z196 processor used a new integration design plat-form [11] that combined routing, timing analysis, parasitic

Page 6: 06064913

156 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 47, NO. 1, JANUARY 2012

extraction, and timing and power optimization, enabling in-tegrators to automatically optimize from scratch or performincremental, localized fixes with immediate feedback from theanalysis tools. This methodology significantly reduced manualintegration work. Unit-level manual work and bookkeepingwere required for up to 50% of nets in the previous design,but dropped to 2% of nets with the new methodology, withsome units requiring no manual engineering work despite thehigh design frequency. The automatic timing optimization flowwas able to implement more than 99% of all slew fixes whilesimultaneously reducing buffer count by 15% compared toprevious methods. Finally, new power-down algorithms wereused towards the end of the design process, saving up to 30%of the total power consumed for global wires and buffers.

B. Global Clock Grid Design

The z196 clock distribution network includes four core clockgrids (one for each core), all running at 5.2 GHz, with a nom-inal design skew of under 5 ps and a 10–90 slew of about 40 ps.A separate nest grid runs at half the core frequency, and thereare two asynchronous grids for the MCU, and GX I/O inter-face. Power reduction was the primary motivation for splittingthe main chip clock grid into separate nest and core grids, com-pared to the previous generation design which used a unifiedchip-level grid with local clock division as needed. The split-grid design resulted in about a 30% clock grid power reductioncompared to a unified grid design, as a result of the reductionin the clock grid frequency over the nest regions. The MCU andGX have their own phase-locked loops (PLLs) running at 2.4and 2.3 GHz respectively, with the GX having the additionalfeature of being configurable to run off the nest PLL, dependingon which external facility is being connected.

The most significant challenge for the clock distribution teamwas that of latency matching between the core and nest clocks tosupport synchronous communication across the core/L2 cacheboundaries. To this end the nest mesh was divided into multiplesub-meshes, each of which mimicked the core mesh topology.The repowering from each sub-mesh input to mesh driver wasthen matched as closely as possible to the core mesh repoweringcircuitry. The distributions from PLL to mesh/sub-mesh inputswere also tuned and balanced together. These distributions alsoincluded programmable circuitry to allow tuning of the core-to-nest mesh skew and core or nest clock duty cycle.

C. Design RAS Features

The focus on Reliability, Availability, and Serviceability(RAS) started with the basic design building blocks, andcontinued throughout the design process. For large multi-coreserver processors, single event upsets can be a significant reli-ability issue, especially as technology scaling pushes designsto ever larger scales of integration, and towards operatingmodes which use lower supply voltages for reduced powerconsumption. In this regard, the SOI technology gives thez196 design a robust starting point, and the design is furtherimproved by the stacking of transistors driving critical nodes inthe majority of the clocked storage elements [12]. At the circuitlevel, techniques such as parity checking, residue calculation,

Fig. 7. Recovery Unit (RU) and checkpoint data in the processor core.

checking for illegal states, and local duplication of logic areused to provide a high level of coverage for errors which mightoccur during operation. Core dataflow elements and registersare close to 100% covered, while in the control logic, all majorregisters have parity protection. Detailed statistical modeling isused to find and address design weaknesses during the designphase, ensuring that the final design meets all reliability targets.Overall, circuitry for error detection and recovery (includingthe R unit, below) is estimated to represent about a 20–25%area overhead on all the core digital logic. The private L1and L2 caches, along with the interface to the L3 cache, areprotected with ECC.

Hard failures are a more obvious concern, and the z196local clocking scheme incorporates features to allow stressingpotential race conditions during test to make sure adequatemargin exists through end of life [12]. In addition, the physicaldesign incorporates a high degree of redundancy for contactsand vias.

To enable a smooth recovery in the event that an error is de-tected, the system state is checkpointed and monitored by a Re-covery Unit (RU), as shown in Fig. 7. The out-of-order execu-tion increases the complexity of this checkpointing. Register re-naming, as mentioned in the core logic overview above, is usedon the General Purpose Registers (GPRs), the Floating-PointRegisters (FPRs), and the Access Registers (ARs), all of whichare checkpointed outside the RU, as is the Mapper status, whichcontrols the logical to physical address mapping. The RU con-trols and synchronizes this checkpointing with a network oferror signals. The rest of the system state is checkpointed withinthe RU, in a set of registers (labeled MCR in Fig. 7). All check-pointed state is protected either by ECC or parity and duplica-tion, so that any single-bit error can be corrected.

When an error is detected, the RU starts by draining the storequeue, and fencing the processor off from the L3 cache. All ex-ecution logic, registers, and register files are reset (except forcheckpointed data), the system state is refreshed from the check-pointed data, and the processor is restarted by refetching thenext instruction from memory. In the event that the error per-sists, and recovery is not possible, the core is assumed to have

Page 7: 06064913

WARNOCK et al.: CIRCUIT AND PHYSICAL DESIGN IMPLEMENTATION OF THE MICROPROCESSOR CHIP FOR THE zEnterprise SYSTEM 157

permanent damage, and enters into a checkstop state. Each coreunit, containing checkpointed data, has its own serial commu-nication satellite, from where the data are read out and portedto a system-level spare core, which then restarts from where theerror was first detected.

Outside the cores, the data into and out of the L3 andL4 DRAM cache are protected by in-line, single error cor-rection/double error detection (SEC/DED) error correctioncodes (ECC). The corresponding directories, constructed fromSRAMs, are also protected by in-line ECC. The L3 and L4cache and directories are further protected with line purge (theprocess of evacuating a physical cache location on the detectionof a correctable error (CE)). After a line has been purged, a newline can be stored in that same line location. If a physical linelocation has been purged twice in a row, the line is then purgedand deleted. A line delete ensures that the location is not usedagain. On the next IML, a delete will be converted to an arrayrepair.

All data, address, and command buses are protected byin-line ECC throughout the whole Synchronous Multi-pro-cessor (SMP) system. This includes all paths from the L3 to theL4 levels of cache and from L4 cache to memory as well. Fabricinterfaces from node-to-node are also in-line ECC protected.There are two spare lanes for each on-module bus which canbe spared out as needed. Firmware monitors any ECC errorsand proactively calls home if there are any overflows of sparesor cache repair resources.

The z196 memory is implemented with an innovative redun-dant array of independent memory (RAIM), which is a RAID-like scheme applied to memory DIMM channels. DIMMs arepopulated in groups of five at a time. They can be cascaded into asecond DIMM cascade by chaining from one DIMM to the otherin each of the five channels. Cascading allows the capacity of thesystem to reach 3 TB, which includes HSA space and customerspace. The RAIM channels are fully protected by ECC. Usingfive channels instead of four, along with ECC, allows the datafrom memory to be corrected in the event of the failure of anysingle channel. This can be done either with channel marking(a way to identify the bad channel) or through the ECC dis-covery/correction process itself.

The memory buses are protected by CRC and use a tiered re-covery system. If there are any low-frequency, intermittent er-rors within a channel, the bus CRC will detect this and use theCRC indicator to force RAIM correction of the channel. Thisis the first tier, or Tier-1 recovery. This allows for soft error re-covery such that memory operations can continue unaffected. Ifthe lane errors appear at a high rate (or if a solid error appears),the second recovery tier, Tier-2, will be invoked which self-re-pairs bad data lanes by switching over to a spare lane to transmitthe data. If the bus errors are caused by a bus clock failure, thethird recovery tier, Tier-3 is activated and replaces the bad clockwith an alternate clock. During Tier-3 recovery, the bad channelis taken offline while the clock is being repaired. This means theother four channels continue to run in a RAIM degraded mode.Following Tier-3, there is a fast scrub clean-up phase to get thedata in the repaired channel back in sync with the other fourchannels. Finally, there is also repair capability for intermittentclock and data errors.

D. Processor Energy Efficiency and Power Analysis

A persistent focus on power reduction was critical to the suc-cess of the z196 project. Design tradeoffs for processor speed,performance and energy consumption were considered duringproject concept phase and practiced through detailed implemen-tation stages on all aspects of the processor design. A compre-hensive power methodology [13] was developed to calculateleakage and dynamic power dissipation at various workloadsand to track power reduction progress. As a result of this ef-fort, the energy efficiency of the processor chip was improved by25% compared to its predecessor [13]–[15], enabling the z196system to stay within the same power envelop as the previous de-sign, while the system capacity per watt was improved by morethan 70% [16]. These improvements are described in the sec-tions below.

Energy consumption work started with chip voltage domainplanning. The processor chip has multiple supply voltage railswith a primary rail distributed across the entire chip and sup-plying all logic devices. All SRAM and DRAM cells receive a2nd rail, commonly referred as the array supply (“Vcs”). Thesetwo main supplies are broken into multiple domains at the chiplevel (as described in the next section), but these separate do-mains are tied together at the package level. The logic and arraysupply rails are unique for each processor chip on a given multi-chip module (MCM), which enables chip-specific voltage se-lection to minimize the overall MCM power consumption at therequired frequency. All rails are designed with minimum resis-tive power consumption in their delivery paths including on- andoff-chip distribution structures.

The use of on-chip DRAM for the L3 cache provided a signif-icant improvement in cache power efficiency while simultane-ously increasing bandwidth and performance [17]. For an equiv-alent chip area, DRAM reduces leakage power by 60% at triplethe cache capacity compared to an SRAM implementation. Thisreduced the total chip power by about 4%.

In addition to the global chip clock grid improvements de-scribed in the previous section, fine grained local clock gatingwas implemented throughout the design, with average clockgating for the processor core exceeding 60% for typical perfor-mance workloads. For typical workloads, dynamic clock gatingreduced the overall logic switching power by 10%. Improve-ments in clock gating efficiency saved about 4% of overall chippower, compared to the previous design.

Reduction of switching-induced noise on the processor logicand array supply rails was another method used to significantlyimprove the chip energy efficiency. A noise reduction of about50% was obtained compared to the previous design, by the ad-dition of large amounts of on-chip decoupling capacitance asdescribed in a later section. This reduction in the supply noiseled to about a 15% overall chip power reduction at the targetedfrequency, by enabling the chip to run at a lower supply voltageset point.

The modeled power breakdown for a nominal chip running atypical workload is shown in Fig. 8. Fig. 8(a)) shows the totalchip power breakdown by function, where it can be noted thatabout 70% of the total power is consumed by the 4 cores andcore clock grids. Fig. 8(b)) shows the chip leakage and active

Page 8: 06064913

158 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 47, NO. 1, JANUARY 2012

Fig. 8. Chip power breakdowns (a) by component, (b) active vs. leakage byrail.

Fig. 9. Measured chip power (a) vs. process speed and (b) vs. workload.

power components by supply rail. The total leakage of a nominalchip running a typical workload is seen to be about 30% of thetotal chip power.

Fig. 9 shows the measured chip power data. Fig. 9(a)) showsrelative processor power measured in an actual system settingrunning a maximum activity workload at 5.2 GHz. For eachprocessor chip, logic and array supply voltages were adjustedaccording to the process speed to meet the required frequencyspecification and to minimize the chip power. The chip poweris almost constant across the process speed range as shown,indicating that the AC-DC power ratio is relatively optimal.Faster, leakier, chips are able to run at supply voltages lowenough to make up for their increased leakage, while slowerchips have large enough reduction in DC power to compen-sate for the increased AC power at the higher supply voltageneeded. Fig. 9(b)) shows the z196 processor power variationas a function of workload. Max power was characterized witha workload engineered to produce the maximum core activity,while typical power reflects actual workload expectations.Maximum chip power averaged about 260 W, although thesystem power constraint was imposed at the multi-chip-modulelevel (ie for a collection of processor chips) rather than atthe individual processor chip level. The power with a typicalworkload was about 85% of the max power number. Theadditional power reduction from Quasi-Idle to final Idle state isachieved by invoking a power saving mechanism implementedvia millicode control [13].

E. Chip IR Drop Analysis

Given the relatively high power densities expected over partsof the chip, especially within the high-frequency processorcores, detailed modeling, analysis, and optimization of thepower grid was essential to the success of the project, andthe ability to operate at frequencies in excess of 5 GHz. Afull-chip DC power grid analysis was carried out to focus onidentification of possible voltage drop (IR) issues resultingfrom high power density areas or from discontinuities in thegrid. In addition current through power wires and via structureswas studied carefully in order to ensure that the design met astringent set of electromigration (EM) reliability requirements.

The z196 chip provided unique challenges for IR/EM evalua-tion due to the complexity of the power grids, the size of the die,and the local power densities expected. Fig. 10 shows a floor-plan with the associated voltage rail names. There are a total of15 logic and IO voltage islands, some of which contain multiplesupplies to support embedded SRAM or DRAM arrays. How-ever the core/L2 islands are identical from C4 to M1 across all4 instances, reducing the number of unique voltage islands tonine. This simplified the chip IR/EM evaluation by allowing theeffort to focus on only one core/L2 instance. Also, due to thehigh power density regions in the core, much of the analysisand iteration focused on the core alone. Thus the core analysiscould be carried out independently from that of the nest, therebyspeeding the identification of hot spots and the provision of fixesin the form of extra wiring or circuit power down schemes.

The power grid analysis was carried out using an in-housetool, ALSIM [18], integrating resistance extraction, electricalanalysis, and reporting steps into a single tool framework. Loadcurrent estimation was performed at the macro level by CPAM[19], a vector based transistor level simulator. The outputs fromCPAM include detailed current measurements analyzed at thefirst wiring via level (V1), along with the x and y coordinatesof the V1 vias, referenced to the macro origin. This data wasmerged into the ALSIM data structure for subsequent analysis.

Based on early analysis, IR targets for the z196 chip wereset at 3% of the supply voltage, or approximately 40 mV totalpower-ground offset (20 mV per rail). Fig. 11 shows typicalvoltage contour maps of the logic voltage rails for the chip undera maximum workload type of scenario.

F. Chip Power Supply Noise

Given the high power densities encountered in the z196design, power supply noise, especially in the core regions, wasof particular concern, and needed to be minimized in orderto enable the highest possible chip frequency. The IBM 45nm technology offered an efficient means of mitigating theseconcerns. A benefit of the deep trench process, required for theon-chip L3 EDRAM cache, is that a deep-trench-based on-chipcapacitor cell was available for supply voltage decoupling.The capacitance density of the deep trench structure provides amuch higher total die decoupling capacitance than that achiev-able with the same amount of area devoted to planar decapstructures. In comparison to the previous generation design,which had an overall die capacitance of around 1 uF, the z196processor chip has a total die capacitance of more than 15 uF

Page 9: 06064913

WARNOCK et al.: CIRCUIT AND PHYSICAL DESIGN IMPLEMENTATION OF THE MICROPROCESSOR CHIP FOR THE zEnterprise SYSTEM 159

Fig. 10. z196 voltage islands.

Fig. 11. Chip IR drop in max workload case. (a) Core. (b) Nest.

which is heavily leveraged to minimize on-chip power supplynoise.

Power supply noise on the combined SRAM and DRAMpower domain proved to be of particular concern. The DRAMmodules generate regular current bursts of several Am-peres in magnitude caused by the periodic refresh operation.This shows up as a regular pattern of voltage spikes on theDRAM/SRAM chip voltage supply when the system is in idlemode. Fig. 12(a)), upper trace, shows this pattern measured onearly z196 hardware, which included approximately 265 nF

on-chip capacitance on the combined DRAM/SRAM voltagerail. Each DRAM refresh led to a drop in supply voltage ofalmost 75 mV.

During periods of heavy L3 cache utilization, the DRAMcurrents are even larger and very irregular. Fig. 12(b), uppertrace, shows measurements of the DRAM/SRAM supply onearly z196 hardware during a phase of heavy system cacheutilization. Peak-to-peak noise was observed to be more than300 mV, which exceeded the system noise specification limits.Therefore later hardware included approximately 1.5 uF addi-

Page 10: 06064913

160 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 47, NO. 1, JANUARY 2012

Fig. 12. Array supply noise, early vs. later hardware. (a) Idle. (b) Heavy DRAMaccess.

tional capacitance added to the DRAM/SRAM voltage supply,particularly in areas close to the DRAM modules. The lowertraces of Fig. 12(a) and (b) show the dramatic voltage stabilityimprovements as a result of this capacitance increase: theDRAM/SRAM power noise during functional exercisers wasdecreased by more than 60%, while the refresh noise observedin idle mode dropped by almost a factor of five.

G. Chip Frequency Tuning

To get the final increment of performance from the design,various clock control switches [3] were tuned in such a wayas to optimize the final product frequency. There were threeclock control bits that were used to drive to a more aggressiveproduct frequency. The first such switch was used to disableclock gating. Some of the most critical timing paths in our de-sign were the logic paths needed to disable clocks when logicwas not being used. If this clock gating was done for functionalreasons, it could not be disabled. However, one of the primaryuses of clock gating was to save power when areas of logicwere not being used. Each individual macro had its own pro-grammable clock gate override, so that if it was found that agating signal to a particular macro was limiting chip frequency,it could be disabled with only a negligible power penalty.

The second control setting would delay the clock pulse arrivalat a latch. The majority of the clocked storage elements usedself-timed clock pulses to write the data [12], so that delayingthe pulse would do one of two things. It could “aggravate” thepaths that were launched out of a critical macro, resulting in alower failing frequency because of the extra delay at the begin-ning of the cycle. This mode was particularly useful for criticalpath debug. This setting could also “alleviate” paths that werecaptured by a critical macro, raising the failing frequency bygiving extra time to capture the data at the end of the cycle. Al-though this mode of use was also useful for lab debug, it was notused in the final product for timing path alleviation because the

Fig. 13. Clock duty cycle adjustment.

amount of delay added by this switch would often cause otherlogic paths to fail.

The third switch would delay only the trailing edge ofthe clock pulse, allowing a critical data signal to arrive later(stealing time from the next cycle). Since there was no chanceof negatively impacting product frequency with this switch, itwas used in specific macros to improve final product frequency.The increased risk of hold time violations was mitigated byextensive testing over a wide PVT range.

The use of these switches provided diminishing returns asthe variety and number of timing paths unveiled increased withincreasing chip frequency. Once it had been determined thatthere was no longer a benefit to adding further control settingsto the product, the final set of clock control bits was added tothe product engineering data.

Overall product frequency was also improved via globalclock duty cycle optimization. As the chip frequency wasaggressively pushed to the limit, critical half cycle paths wereidentified in register files and SRAM arrays, that could beimproved by altering the clock duty cycle. Although this sortof adjustment could not be done on a macro-by-macro basis,each chip could be assigned its own unique duty cycle set-ting, targeted to maximize the clock margins and/or the chipfrequency. Clock duty cycle was measured via “skitter” teststructures [20] on each chip on a multi-chip module, and if itwas found to be below a certain threshold on any given chip, aduty cycle adjustment was programmed for that chip. Fig. 13shows the range of measured duty cycles on all chips and thethreshold that was used to determine if an adjust was needed.Core-specific adjustments were also possible, but it was foundthat this capability was not needed.

The hardware-based chip frequency tuning provided the finalincrement of performance to the design, pushing the productfrequency up by about 1.5% under a constant-power constraint.

V. CONCLUSIONS

The z196 chip was designed to push frequency and perfor-mance to new limits, using unique features in the 45 nm tech-nology, tools and methods optimized for high-frequency design,and a variety of special circuits in key critical areas. At thechip level, detailed attention to wiring solutions, and a focus onpower, IR drop, and supply noise were all key enablers of this

Page 11: 06064913

WARNOCK et al.: CIRCUIT AND PHYSICAL DESIGN IMPLEMENTATION OF THE MICROPROCESSOR CHIP FOR THE zEnterprise SYSTEM 161

high-performance design. These design goals were met whilesimultaneously adding new RAS features to the design, startingfrom the base circuit level and proceeding up to the systemlevel, all without sacrificing the RAS features introduced in pre-vious designs. This combination of industry-leading RAS andunprecedented chip frequency allows the zEnterprise system toreach new levels of performance and stability for the most crit-ical workloads.

ACKNOWLEDGMENT

The authors would like to acknowledge and thank the manytalented engineers who contributed to the success of the design,as well as the rest of the global system z design team. In addition,the research work, development efforts, and tireless support ofIBM’s EDA team are gratefully acknowledged. Finally, the au-thors would like to thank the technology research, developmentand manufacturing teams for making the fabrication of this chippossible.

REFERENCES

[1] J. Warnock et al., “A 5.2 GHz microprocessor chip for the IBM zEn-terprise system,” in IEEE ISSCC Dig. Tech. Papers, 2011, p. 70.

[2] S. Narasimha et al., “High performance 45-nm SOI technology withenhanced strain, porous low-k BEOL, and immersion lithography,” inIEDM Dig. Tech. Papers, 2006, p. 100.

[3] D. Wendel et al., “POWER7, a highly parallel, scalable multi-core highend server processor,” IEEE J. Solid-State Circuits, vol. 46, no. 1, p.145, 2011.

[4] B. Curran, “The next-generation system z micro-processor,” presentedat the Hot Chips 22, Stanford, CA, 2010.

[5] B. Curran et al., “The zEnterprise 196 System and Microprocessor,”IEEE Micro, vol. 31, no. 2, p. 26, 2011.

[6] A. R. Conn et al., “Gradient-based optimization of custom circuitsusing a static-timing formulation,” in Proc. 36th ACM/IEEE DesignAutomation Conf., 1999, p. 452.

[7] B. Curran et al., “Power-constrained high-frequency circuits for theIBM Power6 microprocessor,” IBM J. Res. & Dev., vol. 51, no. 6, p.721, 2007.

[8] K. Zhang et al., “The scaling of data sensing schemes for high speedcache design in sub-0.18 �m technologies,” in Symp. VLSI CircuitsDig., 2000, p. 226.

[9] A. Pelella et al., “Dynamic hit logic with embedded 8 Kb SRAM in45 nm SOI for the zEnterprise processor,” in IEEE ISSCC Dig. Tech.Papers, 2011, p. 72.

[10] A. Pelella et al., “A 8 K domino read SRAM with hit logic and paritychecker,” in Proc. ESSCIRC, 2005, p. 359.

[11] J. L. Neves et al., “eFinale—Integration platform for high performanceprocessor design,” in Proc. 48th ACM/IEEE Design Automation Conf.,June 2011, to be published.

[12] J. Warnock et al., “POWER7 local clocking and clocked storage ele-ments,” in IEEE ISSCC Dig. Tech. Papers, 2010, p. 178.

[13] H. Wen et al., “IBM zEnterprise energy efficient 5.2 GHz processorchip,” presented at the 2011 ICICDT, Kaohsiung, Taiwan.

[14] C. K. Shum et al., “Design and microarchitecture of the IBM systemZ10 microprocessor,” IBM J. Res. & Dev., vol. 53, no. 1, 2008.

[15] C. F. Webb, “IBM z10: The next-generation mainframe micropro-cessor,” IEEE Micro, vol. 28, no. 2, p. 19, 2008.

[16] M. Andres et al., “zEnterprise energy efficiency and energy manage-ment improvements,” IBM J. Res. & Dev., 2012, to be published.

[17] J. Barth et al., “A 45 nm SOI embedded DRAM macro for thePOWER7 Processor 32 MByte on-chip L3 cache,” in IEEE ISSCCDig. Tech. Papers, 2010, p. 342.

[18] S. R. Nassif et al., “Fast power grid simulation,” in Proc. 37th ACM/IEEE Design Automation Conf., 2000, p. 156.

[19] J. S. Neely et al., “CPAM: A common power analysis methodologyfor high performance VLSI design,” in Proc. IEEE Conf. ElectricalPerformance of Electronic Packaging, Scottsdale, AZ, 2000, p. 303.

[20] R. Franch et al., “On-chip timing uncertainty measurements on IBMmicroprocessors,” in 2007 IEEE International Test Conference Proc.,Anaheim, CA, 2007, paper 1.1, p. 1.

James Warnock (SM’06) received the B.Sc. degreefrom Ottawa University, Ottawa, ON, Canada, andthe Ph.D. degree in physics from the MassachusettsInstitute of Technology, Cambridge, MA.

Since then, he has been at IBM in YorktownHeights, NY, working on high-speed microproces-sors including IBM’s S/390 G4, POWER4, the CellBroadband Engine, POWER7 and the zEnterprise196. His interests include VLSI circuit design toolsand methodology, clocked storage elements, designfor test, and design-technology interactions.

Dr. Warnock is a Distinguished Engineer in IBM’s Systems and TechnologyGroup and a member of the IBM Academy of Technology.

Yiu-Hing Chan received the B.E.E. degree from CityCollege of New York, NY, in 1977 and the M.S.E.Edegree from Syracuse University, Syracuse, NY, in1983.

In 1977, he joined IBM in Endicott, NY, wherehe worked on circuit and chip design for impactprinters and mainframe systems. In 1994, he joinedthe CMOS microprocessor design team in Pough-keepsie, NY, where he has been involved in thedevelopment of several zSeries servers and Power6microprocessors.

Sean Carey received the B.S. degree in electricalengineering from Clarkson University, Potsdam, NY,and the M.S. degree in electrical engineering fromSyracuse University, Syracuse, NY.

He has led several generations of System-Z pro-cessor designs in the areas of frequency optimizationand timing methodology development. He has alsoled the hardware characterization and analysis effortsfor these designs focusing on frequency/power andyield from the lab all the way through manufacturing.He is a Senior Technical Staff member whose focus

is on power-efficient high frequency design and methodology as well as hard-ware analysis. He is currently heading the system characterization effort of oneof System-z’s future designs.

Huajun Wen (M’00) received the Ph.D. degree insolid-state physics from the Free University Berlin,Germany.

She is an IBM Senior Technical Staff Memberin IBM’s Systems and Technology Group. Sheis currently responsible for energy efficiency andpower performance optimization of the IBM Systemz processor family. She joined IBM’s ResearchDivision in Yorktown Heights in 1995, where shestudied MOSFET gate-oxide reliability as a physicsresearcher. During 1998–2007 she held various

circuit leadership roles on IBM Power processor design projects as well asthe Sony-Toshiba-IBM Cell BE design project, in Austin, TX. Her expertisespans VLSI circuit design/circuit interactions with CMOS process, low powerdesign techniques, chip power estimation/characterization methodology andchip/package noise reduction techniques. She holds many U.S. patents and haspublished numerous technical publications.

Page 12: 06064913

162 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 47, NO. 1, JANUARY 2012

Pat Meaney received the B.S. degree in electricaland computer engineering from Clarkson University,Potsdam, NY, in 1986, and the M.S. degree incomputer engineering from Syracuse University,Syracuse, NY, in 1991.

He is a Senior Technical Staff Member and MasterInventor at IBM as the System Z Memory SubsystemLeader. He recently was responsible for architectingand delivering the RAIM memory design on the z196system. Since joining IBM Poughkeepsie in 1986, hehas held design, timing, and RAS leadership posi-

tions on the ES/9021 bipolar-based machines as well as the s390/zSeries G4,G5, G6, z900, z990, z9, z10, and z196/zEnterprise CMOS systems.

Mr. Meaney holds 46 U.S. patents and has several patents pending. He has re-ceived several awards, including Outstanding Technical Achievements Awardsfor H5, G4, G6, z900 designs, an Outstanding Innovation Award for the G5 de-sign, and a Corporate Award for Memory RAIM on z196.

Guenter Gerwig (M’03) received the B.S. and M.S.degrees in electrical engineering from the Universityof Stuttgart, Stuttgart, Germany, in 1978 and 1981,respectively.

After joining IBM in 1981, his main challengewas floating-point design as team leader on severalCMOS mainframe processors. Lately, he has been acore lead on recent mainframe designs, focusing onrecovery logic and process.

Mr. Gerwig received an IBM Invention Achieve-ment Award and holds more than 20 patents.

Howard H. Smith received the B.S. and M.S. degrees in electrical engineeringfrom the New Jersey Institute of Technology (NJIT), Newark, NJ, in 1984 and1985, respectively.

He joined IBM in 1984 as an integrated circuit engineer at its semiconductordevelopment laboratory in Fishkill, NY, working in the area of high-perfor-mance gate array designs. Howard is currently a Senior Engineer in IBM’s Sys-tems and Technology group at Poughkeepsie, NY, where he is responsible forelectrical analysis issues associated with high density CMOS circuit and chiptechnologies. His recent assignments include the development of on-chip noiseand power grid verification processes for IBM processor chips. His expertise liesin the area of electrical noise modeling, analysis and prediction at both circuitand system levels of operation. He has co-authored many papers on noise andpower grid analysis methodology issues and solutions, and he also has manypatents in this area. He looks forward to challenges associated with future chiptechnologies and associated multi-core server chip designs

Yuen Chan (M’77) received the B.S. degree in elec-trical engineering from Union College, Schenectady,NY, and the M.S.E.E. degree from Syracuse Univer-sity, Syracuse, NY.

He is a Senior Technical Staff Member at IBMworking on custom circuit and SRAM design. Hehas over 30 years experience on high performanceBipolar, BiCMOS, and CMOS array development.He is currently the Z-Server high speed array lead ofthe IBM System & Technology Group, supportingboth the zSeries and POWER processor applications.

Mr. Chan is an IBM Master Inventor and holds over 50 circuit patents.

John Davis received the B.S.E.E. degree from theNew Jersey Institute of Technology, Newark, NJ, in1984.

He joined IBM East Fishkill in 1984. During hiscareer with IBM, he has designed circuits in Bipolar,BiCMOS, and CMOS technologies. He is currentlyan SRAM team leader in the IBM server group.

Paul Bunce received the B.S.E.E. degree from theNew Jersey Institute of Technology, Newark, NJ, in1984, and the M.S.E.E. degree from Syracuse Uni-versity, Syracuse, NY, in 1992.

He joined IBM in East Fishkill, NY, in 1984.During his IBM career, he has designed circuits inBipolar, BiCMOS, and CMOS technologies. Heis currently an SRAM design leader in the IBMSystems and Technology Group.

Antonio Pelella received the B.S.E.E. degree in 1983and the M.S.E.E. degree in 1985 from Clarkson Uni-versity, Potsdam, NY.

He is a Senior Engineer in IBM’s Systems andTechnology Group in Poughkeepsie, NY. He has26 years of experience with SRAM/Hit Logic inBipolar, BiCMOS, Bulk CMOS & SOI CMOStechnologies. He has been the principle designerfor L1 caches, TLBs, directories, Stand-alone “HitLogic” macros and SRAM macros with built-in HitLogic & Parity checkers. He holds 25 U.S. patents,

and has been the principal author and presenter of five papers and co-author often papers.

Dan Rodko received the B.E. degree in electrical en-gineering from The Cooper Union for the Advance-ment of Art and Science, New York, NY, in 2000,and the M.E. degree with a concentration in micro-electronics from the Rensselaer Polytechnic Institute,Troy, NY, in 2006.

He joined IBM Poughkeepsie in 2000. He hasworked as an SRAM designer for IBM zSeries ma-chines. He has worked as an SRAM BIST designer,driving ABIST implementation on all IBM Systemsand Technology Group projects, including pSeries

and zSeries. His latest interest is driving SRAM hardware characterization onthe system test floor.

Pradip Patel received the M.S. degree from the Uni-versity of Illinois, Urbana, IL, in 1987.

He joined IBM in East Fishkill in 1987. He hasworked on array built-in self-test for IBM system Zand P processors. He is currently leading the SRAMABIST design for future POWER and Z next-gener-ation processors.

Page 13: 06064913

WARNOCK et al.: CIRCUIT AND PHYSICAL DESIGN IMPLEMENTATION OF THE MICROPROCESSOR CHIP FOR THE zEnterprise SYSTEM 163

Thomas Strach received the M.S. degree from Mc-Master University, Hamilton, ON, Canada, in 1992and the Ph.D. degree in physics from the Universityof Stuttgart, Stuttgart, Germany, in 1997.

He joined IBM in 1997 and started working forthe IBM Systems and Technology group in 2004 asa staff engineer. His main responsibility is the powerdecoupling design of IBM high-end servers. His cur-rent fields of interest include the simulation and ex-perimental verification of on-chip power noise prop-agation as well as the design of guidelines for the ef-

fective placement of on-chip decoupling cells.

Doug Malone received the B.S. degree in physicsfrom Manhattan College, New York, NY, and theM.S.E.E. degree from Syracuse University, Syra-cuse, NY.

He is a Senior Engineer and has held a numberof technical and managerial positions in his 30 yearswith IBM. Has worked in chip integration and SRAMdevelopment and test for the IBM Z- and P- proces-sors. Currently he is the high frequency clock distri-bution team leader for the Z- processor family.

Frank Malgioglio received the B.E.E.E. degree fromManhattan College, New York, NY, in 1989, and theM.S. degree in electrical engineering from SyracuseUniversity, Syracuse, NY, in 1994.

He joined IBM in 1989. He is currently a SeniorEngineer with IBM, Poughkeepsie, NY, and wasthe physical architect of the Z9 microprocessor,responsible for chip assembly, electrical integrity,tools/methodology, design verification, as well asrelease into manufacturing. He has performed thisrole for several generations of Z mainframe proces-

sors and continues to drive the physical chip integration and methodology forZ-series chips as well as Power Series chips.

José Luis Neves (A’11) received the B.Sc. degree inelectrical engineering, the M.S. degree in computerscience from the Federal University of Minas Gerais(UFMG) Brazil, and the Ph.D. degree in electricalengineering from the University of Rochester,Rochester, NY.

He joined IBM in 1986 working in toolsand methodology development for ASICS andhigh-speed microprocessors. Since 2005, he hasworked on several microprocessors including z10and zEnterprise 196. Since 2006, he has led the

development of eFinale, an integration design platform for timing, power,and wiring closure. His interests include timing and low power optimizationtechniques, EDA tool development and methodology, parallel processing, andmicroprocessor design. He has written 27 papers and holds 18 patents.

Dr. Neves is a Senior Engineer in IBM’s Systems and Technology Group.

David L. Rude received the B.S.E. degree fromthe University of Central Florida, Union Park, FL,in 1978, and the M.S.E.E. degree from SyracuseUniversity, Syracuse, NY, in 1997.

Since joining IBM in 1978, he has held positionsin development and manufacturing engineering atKingston and Poughkeepsie, NY. He has worked ingas panel display development, energy conservation,electronic test equipment for large circuit boards,timed/statistical coupled noise methods, and CMOScircuit design and integration. Recently, he has been

a circuit lead for System z designs.

William Huott received the B.S. degree in electicalengineering from Syracuse University, Syracuse, NY,in 1984.

He has been with IBM for 27 years, and is cur-rently a Distinguished Engineer at IBM Poughkeesieleading the IBM Server Group Design For Test (DFT)strategy. He is a Master Inventor at IBM, with nu-merous patents and published articles.