Sam Naffziger AMD Senior Fellow -...

High Performance Processors in a Power Limited World

Sam NaffzigerAMD Senior Fellow

Outline

Today’s processor design landscapeTrends

Issues making designer’s lives difficultPower limitsScaling effects

Design opportunitiesCircuit levelArchitectural

Summary

The All Consuming Quest for Greater Performance at Lower Cost

Increasing Transistor Density

Moore’s Law has served us well.

Increasing Performance

Processor Frequency vs. Time

The amazing frequency increases of the past decade have leveled off – Why?

MPU Performance vs Time

Jan-97

Jan-98

Jan-99

Jan-00

Jan-01

Jan-02

Jan-03

Jan-04

Jan-05

Jan-06

Jan-07

Jan-08

Jan-09

Power Limits Process Issues

Outline

Summary

Power Consumption Background

Source: Holt, HotChips 17

Power has always challenged circuit integration

Bipolar → NMOS → CMOS

We’ve been bailed out by technology in the past

Scaling Background

2005 ITRS Projections of Vt and Vdd

Source: Horowitz et al, IEDM 2005

Jan-85 Jan-88 Jan-91 Jan-94 Jan-97 Jan-00 Jan-03

Feat Size (um)

Power/10

Dennard Scaling

Scaling doesn’t bail us out any more

Realistic power limit

The The Processor Processor DesignerDesigner

Process Scaling

Issues

Power Consumption Background

P ≈ CTOT·α·F·Vdd2 + NTOT·α·F·Vdd·ICO + NON·ILEAK·VddSwitching Power Crossover Power Leakage Power

Reducing VddReducing CTOT

Reducing ILEAK, ICO

Reducing α

The Process guys have had the biggest impact on these

But now, not only are those improvements fading, but we have a host of new challenges

VariationVoltage droopWire non-scaling

Outline

Summary

Scaling Issues

The Silicon Age Still on a Roll, But …

Medium High Very HighVariability

Energy scaling will slow down>0.5>0.5>0.35Energy/Logic Op scaling

11111111RC Delay

Low Probability High ProbabilityAlternate, 3G etc

High Probability Low ProbabilityBulk Planar CMOS

Delay scaling will slow down>0.7~0.70.7Delay = CV/I scaling

8162232456590Technology Node (nm)

2018201420122010200820062004High Volume Manufacturing

But … all this scaling has some nasty side effectsITRS RoadmapITRS RoadmapITRS Roadmap

65nm 45nm 32nm 22nm2007 2010 2013

PDSOI FDSOI

stressors + substrateengineering

+ high µmaterials

MuGFETMuCFET

SiONpoly

high kmetal

gate stack

planar 3D

Source: European Nanoelectronics Initiative Advisory Council (ENIAC)

Device Variation Reverse Scales

Source: Pelgrom, IEEE lecture 5/11/06

Variations subtract directly off cycle time

power efficiency dropsCircuit margins degrade

The Problem:Atoms don’t scale

One impact of variation is leakage spreads

Fmax (a.u)

1.25X Fmax

>3X SIDDspread

Chip SIDD set by “smallest” gates; Fmax set by slowest gates;

Scaling Intrinsically Hurts Supply Integrity

250nm 180nm 130nm 90nm

Technology Node

Leakage %

Normalized Vdd

Vdd Droop %

Source: Bose, Hotchips 17

With power per core staying constant but area, voltage and cycle times dropping, we have a big challengeRequiring a higher voltage to hit frequency is a quadratic power impact

Outline

Summary

Scaling Issues

Circuit Designer

Some Ways to Shoulder the Variation Burden: Adaptive clocking

Empirically set the clock edge to optimize frequency

Higher granularity → more variation tolerance

LBIST and GA search algorithms show promise for per-part optimization

ProgrammableDelay Buffers

Some Ways to Shoulder the Variation Burden: Self Healing Designs

Simplest example is cache ECC on memory arrays

Next level is Intel’s Pellston technology implemented on Montecito and Tulsa

Disable defective lines detected by multiple ECC errors

Future directions involve self-checking with redundant logic and retry

Predict result through parity, residues or redundant logicOn an error, replay calculation before committing architectural stateIf replay correct, it was a transient error (particle strike, Vdddroop, random noise coupling etc.)If incorrect can reduce frequency, increase voltage or retry with an alternate execution path

Some Ways to Shoulder the Variation Burden: Self Healing Designs

From Fall Microprocessor From Fall Microprocessor Forum 2006Forum 2006

Adaptive Supply Voltage

Energy / Operation

Channel Length

Long VddLow

Per-part and dynamic voltage management are key

More range flexibility and finer grain response will provide differentiation

Integrated Power and Thermal Management

“Fuse and forget” is no longer viable

Too much variation in environment, manufacturing and operating conditionsSome means of dynamic optimization needed

Fmax (a.u)

Integrated Power and Thermal Management

An autonomous programmable controller enables real time optimizations

An embedded controller provides the needed flexibility

OS interfacingMulti-core managementPer-part optimization

Scaling IssuesOutline

Summary Chip Architect

Traversing the Power ContourPowe r Consumption

Channel Length

Long VddLow

P ≈ CTOT·α·F·Vdd2 + NTOT·α·F·Vdd·ICO + NON·ILEAK·VddSwitching Power Crossover Power Leakage Power

Frequency

Channe l Length

Long VddLow

Traversing the Power Contour

Traversing the Power Contour for a Given Implementation

Energy / Operation

Channel Length

Long VddLow

Max performance

Best power efficiency

Performance^3 / Watt

Channel LengthNom

Long VddLowNom

For Comparing Architectural Efficiency, Performance3/W is most effective

• The industry has been moving from “hyper-pipelining”with short pipe stages, to something more moderate

Pipe stages

Williamette(2000)

Prescott(2004)

Core2(2006)

Opteron Power6

~2X power efficient

Optimal Pipeline Depth

A Look at Mobile Processor Power

A Look at Mobile System Power

Mobile System Power

TDP Average Power

Rest of systemChipsetMemory controllerCPUMemory

If a laptop burned TDP power all the time, battery life would be measured in minutes

How do we get mobile average power so much lower than TDP?

MobileMark 2002 Tj 95 1800MHz 1.35V

Core Power

IO Power0

0 10 20 30 40 50 60 70

Time(min)

Adobe FlashMSOutlook

MSWord

MSExce

MSExcel

MS Word

MS WordandNetscape

The Answer: Take Advantage of Typically Low CPU Utilization

Processor Utilization During Normal Laptop Usage as Approximated by MobileMark

Reducing Power and Cooling Requirements with Processor Performance States

PP--StateStateHIGHHIGH

LOWLOW

P02600MHz

1.40V~95watts

P12400MHz

1.35V~90watts

P22200MHz

1.30V~76watts

P32000MHz

1.25V~65watts

P41800MHz

1.20V~55watts

P51000MHz

1.10V~32watts

PROCESSORPROCESSORUTILIZATIONUTILIZATION

Up to 75% power savings (at idle)!

Average CPU Core Power(measured at CPU)

10500 Connections(~62% CPU Utilization)

5000 Connections(~40% CPU Utilization)

Idle(in OS)

AMD PowerNow!TM DISABLED

AMD PowerNow!TM ENABLED

-62%-75%

Additionally “C-states” reduce power further by cutting clocks completely and dropping voltage to retention levels

Improving Peak Performance per Watt

Watts/(Spec*Vdd*Vdd*L)

0 1 10 100 1000Spec2000*L

Adding Features to Increase Performance

•Increasing execution efficiency has, historically hurt power efficiency•However, the cubic reduction of power with V/F scaling has tended to make this a good tradeoff

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2

energy/opfrequencyperformance

Adding Features to Increase Performance Works with V/F Scaling

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2

energy/opfrequencyperformance

If we hit VMINhowever, the game is over

Voltage scaling has it limitsMore power efficient designs have an

advantageHigh power designs get penalized due

to higher di/dt, higher temperatures etc.

How Hard is Improving Existing Processors?Watts/(Spec*Vdd*Vdd*L)

0 1 10 100 1000Spec2000*L

Peak performance costs more energy/operation

Most of the Big hitter improvements have been heavily mined already Next generation AMD cores

have >> 50% of clocks gated off even for high power code

Current and Next Generation Core Comparison

ClockClock

Gen1 Peak Gen2 Peak

Despite more features, next gen core has substantially lower Cap

Multi-Core to the Rescue?

Sounds like a great story, what’s the catch?

CoreCore

CacheCache

Voltage =1Frequency =1Area =1Power =1Perf =1Perf/Watt =1

CoreCoreCoreCore

CacheCache

Voltage =.85Frequency =.85Area =2Power =1Perf ≈1.7Perf/Watt ≈1.7

Multi-Core to the Rescue?

Some of the catches:

What if you’re already at VMIN? Need to cut frequency in half to stay within power limit

How much parallelizable code is really out there?

More compute capacity means more IO and memory bandwidth demands …

CoreCoreCoreCore

CacheCache

CoreCore

CacheCache

Multi-Core Issues: Amdahl’s Law

There is almost always a portion of an application that cannot be parallelized

This portion becomes a bottleneck as the number of threads is increased

A typical value is in the range of 10%

CoreCoreCoreCore

CacheCache

CoreCore

CacheCache

multi-core speedup with serial code and constant power considered

0.002.004.006.008.00

1 2 3 4 5 6 7 8

Just 10% serial code drops 8 core performance improvement by 41%Ideal

Multi-Core Issues: IO Power

All those extra cores need their own data …

IO power in terms of W/Gb/s has been pretty constant in the range of 20mW for years

CoreCoreCoreCore

CacheCache

CoreCore

CacheCache

• If we increase IO power accordingly, but hold total chip power constant with V/F scaling, things get worse• Overall performance drops by another 10% or so …

multi-core speedup with serial code, constant power+ IO power considered

0.002.004.006.008.00

1 2 3 4 5 6 7 8

The Transition to Parallel Applications

Parallel Applications

Small number of applications (worked by experts for 10+ yrs)

Awkward development, analysis and debug environments

Parallel programming is hard!

Amdahl’s law is still a law…

SW productivity is already in a crisis this worsens things!

Single-threaded Applications

Most of today’s applications

Well understood optimization techniques

Advanced development, analysis and debug tools

Conceptually, easy to think about

Establishing an appropriate balance is key for managing this important transition

Other Architectural Directions: Integration

Not only does the integration of more system components (i.e. memory controllers, IO etc.) improve performance

Bose, HotChips 17

Typical Server Power Breakdown

Integration reduces power significantly as wellIO communication overhead dropsCPU integrated power management can dynamically optimizePower efficiency of special function components (i.e. graphics accelerators, network processors etc.) greatly exceeds that of general purpose CPUs

I/O HubI/O HubUSBUSB

PCIPCI

PCIeTM

Bridge

PCIeTM

BridgePCIeTM

Bridge

PCIeTM

Bridge

I/O HubI/O Hub

8 GB/S

8 GB/S 8 GB/S

8 GB/S

USBUSB

PCIPCI

XMBXMBXMBXMB XMBXMB XMBXMB

Crossbar

HTMem.Ctrlr

Crossbar

HTMem.Ctrlr

Crossbar

HTMem.Ctrlr

Crossbar

HTMem.Ctrlr

System-level Power Consumption

Dual-Core Packages with legacy technology• 692 watts for processors (173w each)• 48 watts for external memory controller

95% More Power

Dual-Core AMD Opteron™ processors• 380 watts for processors (95w each)

• Integrated memory controllers

MCPMCP MCPMCPMCPMCP MCPMCP

Source: Mixture of publicly available data sheets and AMD internal estimates. Actual system power measurements may vary based on configuration and components used

I/O HubI/O HubMemory

Controller Hub

Memory Controller

PCI-E Bridge

PCI-E BridgePCI-E Bridge

PCI-E BridgePCIeTM

Bridge

PCIeTM

Bridge

I/O HubI/O HubUSBUSB

PCIPCI

PCIeTM

Bridge

PCIeTM

BridgePCIeTM

Bridge

PCIeTM

Bridge

I/O HubI/O Hub

8 GB/S

8 GB/S 8 GB/S

8 GB/S

USBUSB

PCIPCI

XMBXMBXMBXMB XMBXMB XMBXMB

Crossbar

HTMem.Ctrlr

Crossbar

HTMem.Ctrlr

Crossbar

HTMem.Ctrlr

Crossbar

HTMem.Ctrlr

System-level Power Consumption

380 watts380 watts

8.58.5wattswatts

Dual-Core Packages with legacy technology• 692 watts for processors (173w each)• 48 watts for external memory controller

95% More Power

Dual-Core AMD Opteron™ processors• 380 watts for processors (95w each)

• Integrated memory controllers

740 watts 380 watts

MCPMCP MCPMCPMCPMCP MCPMCP

692 watts692 watts

Source: Mixture of publicly available data sheets and AMD internal estimates. Actual system power measurements may vary based on configuration and components used

I/O HubI/O HubMemory

Controller Hub

Memory Controller

1414wattswatts

PCI-E Bridge

PCI-E BridgePCI-E Bridge

PCI-E BridgePCIeTM

Bridge

PCIeTM

Bridge

Other Architectural Directions: Integration

Barriers?Integration of heterogeneous designs non-trivialIP barriersSchedule issues with multiple converging components

Big CPU Small

Small CPU

Special Accel-erators

Memory

Integrating dual designs for processor core enable both peak performance and throughput/watt

Watts/(Spec*Vdd*Vdd*L)

0 1 10 100 1000Spec2000*L

Summary (1 of 2)

Silicon process technology is unlikely to be the major engine of processor performance increases in the futureMajor circuit related challenges that we’ve only just started to address lie ahead:

Design for variation tolerance and mitigationMaintaining dynamic voltage headroom within reliability and variation imposed limitsAdaptive, self-healing techniques are a key direction

Designers

Variation

Leakage

Vdd DroopHot spots

Summary (2 of 2)

CPU architectures are converging on modest pipe length, limited issue out of order designsMulti-core is good, but has limits in the not too distant futureHeterogeneous integration is a key direction

Silicon process technology is unlikely to be the major engine of processor performance increases in the future

We’re up to the challenge, but it will be a joint effort …

MooreMoore’’s Laws Law

Design CommunityDesign Community

Sam Naffziger AMD Senior Fellow -...

Documents

GPU evolution Will Graphics Morph Into Compute? Norm Rubin Fellow GPG graphics products group AMD PACT 2008

Inspiron 15 5555 Specifications - Dell€¦ · Computer model Inspiron 15‑5555 Processor • AMD A10‑8700P • AMD A8‑7410 • AMD A6‑7310 • AMD A4‑7210 • AMD E2‑7110

TECH REVIEW: DANIEL RAKOS, AMD DERRICK OWENS, AMD · 1 february 2016 | confidential dominik witczak, amd tech review: daniel rakos, amd derrick owens, amd the most common vulkan mistakes

Data sheet Actuators for dampers€¦ · Actuators for dampers AMD 610, 610 AS, AMD 620, 620 AS AMD 710, 710 AS, AMD 720, 720 AS AMD 810, 810 AS, AMD 820, 820 AS Description AMD actuators

Congresso Regionale AMD-SID 2013 "Annali AMD e dati regionali"/AMD-SID Regional Congress 2013 "AMD Annals and regional data"

Andy Kegel, Sr. MTS Mark Hummel, AMD Fellow Computer Products Group AMD

HSA (HETEROGENEOUS SYSTEM ARCHITECTURE) FROM A …€¦ · HSA (HETEROGENEOUS SYSTEM ARCHITECTURE) FROM A SOFTWARE PERSPECTIVE OCT 2013 GARY FROST AMD SOFTWARE FELLOW . HSA FOUNDATION

AMD APU and Processor Comparisons · AMD APU and Processor Comparisons AMD Client Desktop Feb 2013 AMD

Brand Abuse Online David Naffziger CEO – BrandVerity

USER'S MANUAL Of AMD 690G/690V CHIPSET AMD … · USER'S MANUAL Of AMD 690G/690V CHIPSET & AMD SB600 Chipset Based ... The AMD 690V Chipset incorporates ATI Radeon X1200 Graphics

AMD Ryzen Processor and AMD Ryzen Master Ryzen Processor and AMD … · AMD Ryzen™ Threadripper™ Processor and AMD Ryzen™ Master Overclocking User’s Guide 55931 Rev. 1.10

Brand Keywords - The Truth About Your Competitors By David Naffziger

New Inspiron m301z AMD Průvodce nastavením · 2013. 11. 26. · Ubuntu. je registrovaná obchodní známka společnosti Canonical Ltd.; AMD, AMD Athlon, AMD Radeon, AMD Turion

AMD Brand Guidelines - Welcome to AMD | Processors | · 2014-10-09AMD Brand Guidelines - Welcome to AMD | Processors | Graphics and Technology | AMD

Sam Naffziger AMD Senior Fellow · Delay = CV/I scaling 0.7 ~0.7 >0.7 Delay scaling will slow down Technology Node (nm) 90 65 45 32 22 16 8 High Volume 2004 2006 2008 2010 2012 2014

AMD, Advanced Micro Devices, K86, AMD-K6, AMD Athlon and

HPEC 2007 Norm Rubin Fellow AMD Graphics Products Group norman.rubin at amd.com

FPGAs A short introduction - Agenda (Indico)...80386 80486 Pentium AMD K5 Pentium II Pentium III AMD K6 AMD K6-III AMD K7 Pentium 4 Barton Atom AMD K8 Itanium 2 Cell Core 2 Duo AMD

Issues And Challenges In Compiling for Graphics Processors Norm Rubin Fellow AMD Graphics Products Group

AMD : CPU - ร้านยูคอล: UCall.in.th · AMD : CPU Article AMD Part No. Power Socket Cache Freq ราคา (In.VAT) AMD A-Series Quad-Core APUs ... AMD-AD7600YBJABOX