24

IntelPlatform Capabilities 1X 2X 3X Source: Intel ®® Paxville DP Woodcrest Irwindale Dempsey MV H2 H2 ‘‘06 H1 ‘06 H2 ‘05 H1 ‘05 Platform – Level

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

IntelIntel®® CoreCoreTMTM

MicroarchitectureMicroarchitectureSteve PawlowskiSteve Pawlowski

Intel Senior FellowIntel Senior FellowCTO, GM Architecture & PlanningCTO, GM Architecture & Planning

Digital Enterprise GroupDigital Enterprise Group

Ofri WechslerOfri WechslerIntel FellowIntel Fellow

Director, CPU ArchitectureDirector, CPU ArchitectureMobility GroupMobility Group

Continuing From Last FallContinuing From Last Fall’’s IDFs IDF……

LetLet’’s Take A Look Insides Take A Look Inside

IntelIntel®® CoreCore™™ MicroarchitectureMicroarchitectureThe Energy Efficient Performance LeaderThe Energy Efficient Performance Leader

CoreCore™™ Microarchitecture:Microarchitecture:Higher Performance ANDHigher Performance AND

Lower PowerLower Power

PerformancePerformance EnergyEnergyConsumptionConsumption

PlatformPlatformCapabilitiesCapabilities

1X1X

2X2X

3X3X

Source: IntelSource: Intel®®

Paxville DPPaxville DP

WoodcrestWoodcrest

IrwindaleIrwindale

Dempsey MVDempsey MV

H2 H2 ‘‘0606

H1 H1 ‘‘0606

H2 H2 ‘‘0505

H1 H1 ‘‘0505

Platform Platform –– LevelLevelEnergy EfficientEnergy Efficient

PerformancePerformance

DP Performance Per WattDP Performance Per WattComparison with SPECint_rateComparison with SPECint_rate

at the Platform Levelat the Platform Level

Energy Energy

Perf Perf

5

HistoryHistory

Branch PredictionBranch Prediction

SuperscalarSuperscalar

Out Of OrderOut Of Order

Register RenamingRegister Renaming

SpeculativeSpeculativeExecutionExecution

SSESSE

MicroarchitectureMicroarchitectureandand

ArchitectureArchitecture

NetBurstNetBurstTMTM

SSE 2SSE 2

XEONXEONTMTM

20022002

PENTIUMPENTIUM®® 4420002000

PENTIUMPENTIUM®® IIIIII19991999

PENTIUMPENTIUM®® ProPro19951995

PENTIUMPENTIUM®®

19931993

Hyper Hyper ThreadingThreading

PENTIUMPENTIUM®® MM20032003

PowerPowerManagementManagement

Micro FusionMicro Fusion

ITANIUMITANIUM®®

20012001

EPICEPIC

*Graphics not representative of actual die photo or relative siz*Graphics not representative of actual die photo or relative sizee

IntegratedIntegratedFPUFPU

PipeliningPipelining

i486i486TMTM

19891989

XEONXEON®®

20042004

EM64TEM64T

PERFORMANCE

PERFORMANCE

POWER EFFICIENT

POWER EFFICIENT

PERFORMANCE

PERFORMANCE

Historical Driving ForcesHistorical Driving Forces

0.01

0.1

1

10

1970 1980 1990 2000 2010 20200.01

0.1

1

10

1970 1980 1990 2000 2010 20201

10

100

1000

10000

100000

1970 1980 1990 2000 2010 20201

10

100

1000

10000

100000

1970 1980 1990 2000 2010 2020

Increased PerformanceIncreased Performancevia Increased Frequencyvia Increased Frequency

FrequencyFrequency(MHz)(MHz)

FeatureFeatureSizeSize(um)(um)

2005200565nm65nm

1B+ Transistors1B+ Transistors

1946194620 Numbers20 Numbers

in Main Memoryin Main Memory

19711971I4004 ProcessorI4004 Processor2300 Transistors2300 Transistors

Shrinking GeometryShrinking Geometry

The ChallengesThe Challenges

10

100

1000

1990 1995 2000 2005 2010 2015

CPUCPUPowerPower(W)(W)

30nm

45nm65nm

90nm

0.13um

0.18um0.25um

0.35um

0.5um

0.7um

0.1

1

10

1990 1993 1997 2001 2005 2009

~30%

30nm

45nm65nm

90nm

0.13um

0.18um0.25um

0.35um

0.5um

0.7um

0.1

1

10

1990 1993 1997 2001 2005 2009

~30%SupplySupplyVoltage Voltage

(V)(V)

Power = Capacitance x VoltagePower = Capacitance x Voltage22 x Frequencyx Frequencyalsoalso

Power ~ VoltagePower ~ Voltage33

Power LimitationsPower Limitations Diminishing Voltage ScalingDiminishing Voltage Scaling

17,066 Flops/Watt17,066 Flops/Watt467 Flops/Dollar467 Flops/Dollar

Energy Efficient Performance Energy Efficient Performance –– High EndHigh End

ASC PurpleASC Purple6 MWatt6 MWatt100 TFlops goal100 TFlops goal12K+ cpus 12K+ cpus –– Power5Power5

$230M$230M

DATACENTERDATACENTER““ENERGY LABELENERGY LABEL””

Source: LLNLSource: LLNL

Source: NASASource: NASA

NASA ColumbiaNASA Columbia2 MWatt2 MWatt60 TFlops goal60 TFlops goal10,240 cpus 10,240 cpus –– Itanium IIItanium II

$50M$50M

30,720 Flops/Watt30,720 Flops/Watt1,288 Flops/Dollar1,288 Flops/Dollar

ComputationalComputationalEfficiencyEfficiency

A New EraA New Era……

PerformanceEquals Frequency

Unconstrained Power

Voltage Scaling

PerformanceEquals IPC

THE OLDTHE OLD

THE NEWTHE NEW

Multi-Core

MicroarchitectureAdvancements

Power Efficiency

IntelIntel®® CoreCore™™ MicroarchitectureMicroarchitecture

*Graphics not representative of actual die photo or relative siz*Graphics not representative of actual die photo or relative sizee

ScalableScalableLow PowerLow Power High PerformanceHigh Performance

MeromMerom

ConroeConroe

WoodcrestWoodcrest

65nm65nm

Server Server OptimizedOptimized

Desktop Desktop OptimizedOptimized

Mobile Mobile OptimizedOptimized

IntelIntel®® Wide Wide Dynamic Dynamic ExecutionExecution

IntelIntel®®Intelligent Intelligent

Power Power CapabilityCapability

IntelIntel®®Advanced Advanced

Smart CacheSmart Cache

IntelIntel®® Smart Smart Memory Memory AccessAccess

IntelIntel®®Advanced Advanced

Digital Media Digital Media BoostBoost

IntelIntel®® Wide Dynamic ExecutionWide Dynamic Execution

INSTRUCTION FETCHINSTRUCTION FETCHAND PREAND PRE--DECODEDECODE

INSTRUCTION QUEUEINSTRUCTION QUEUE

RETIREMENT UNITRETIREMENT UNIT(REORDER BUFFER)(REORDER BUFFER)

DECODEDECODE

RENAME / ALLOCRENAME / ALLOC

SCHEDULERSSCHEDULERS

EXECUTEEXECUTE

INSTRUCTION FETCHINSTRUCTION FETCHAND PREAND PRE--DECODEDECODE

INSTRUCTION QUEUEINSTRUCTION QUEUE

RETIREMENT UNITRETIREMENT UNIT(REORDER BUFFER)(REORDER BUFFER)

DECODEDECODE

RENAME / ALLOCRENAME / ALLOC

SCHEDULERSSCHEDULERS

EXECUTEEXECUTE

CORE 1CORE 1 CORE 2CORE 2

4 WIDE 4 WIDE --DECODE TODECODE TO

EXECUTEEXECUTE

4 WIDE 4 WIDE --MICROMICRO--OPOPEXECUTEEXECUTE

MICROMICROandand

MACROMACROFUSIONFUSION

DEEPERDEEPERBUFFERSBUFFERS

EFFICIENTEFFICIENT14 STAGE14 STAGEPIPELINEPIPELINE

ENHANCEDENHANCEDALUsALUs

EACH COREEACH CORE

ADVANTAGEADVANTAGE•• 33% Wider Execution over Previous Gen33% Wider Execution over Previous Gen•• Comprehensive AdvancementsComprehensive Advancements•• Enabled In Each CoreEnabled In Each CoreEnergy Energy

Perf Perf

IntelIntel®® Wide Dynamic ExecutionWide Dynamic ExecutionMicro and Macro FusionMicro and Macro Fusion

DECODEDECODE

EXECUTEEXECUTE

uCODEuCODEROMROM

ADVANTAGEADVANTAGE •• Instruction Load Reduced ~ 15%Instruction Load Reduced ~ 15%****

•• MicroMicro--Ops Reduced ~ 10%Ops Reduced ~ 10%****

WITHOUT MACRO FUSIONWITHOUT MACRO FUSIONWITH MACRO FUSIONWITH MACRO FUSION

INSTRUCTION 3INSTRUCTION 3

INSTRUCTION 2INSTRUCTION 2

INSTRUCTION 1INSTRUCTION 1

INTERNAL INST 1INTERNAL INST 1

COMPLETED INST 3COMPLETED INST 3

COMPLETED INST 2COMPLETED INST 2

COMPLETED INST 1COMPLETED INST 1

DECODEDECODE

EXECUTEEXECUTE

INSTRUCTION 3INSTRUCTION 3

INSTRUCTION 2INSTRUCTION 2

INSTRUCTION 1INSTRUCTION 1

INTERNAL INST 3INTERNAL INST 3

INTERNAL INST 2INTERNAL INST 2

INTERNAL INST 1INTERNAL INST 1

COMPLETED INST 3COMPLETED INST 3

COMPLETED INST 2COMPLETED INST 2

COMPLETED INST 1COMPLETED INST 1

DECODEDECODE

EXECUTEEXECUTE

COMBINED INST 2 & 3COMBINED INST 2 & 3

MACRO FUSION EXAMPLEMACRO FUSION EXAMPLECMP+JMP IN 1 CLOCKCMP+JMP IN 1 CLOCK

MicroMicroFusionFusion

MacroMacroFusionFusion

*Graphics not representative of actual die photo or relative siz*Graphics not representative of actual die photo or relative sizee** Workload dependant** Workload dependant

Energy Energy

Perf Perf

IntelIntel®® Intelligent Power CapabilityIntelligent Power Capability

UltraUltraFineFine

GrainedGrainedCoarseCoarseGrainedGrained

•• AggressiveAggressiveClock GatingClock Gating•• EnhancedEnhancedSpeedSpeed--StepStep

•• Low VCC ArraysLow VCC Arrays•• Blocks ControlledBlocks Controlled

Via SleepVia SleepTransistorsTransistors

•• Low LeakageLow LeakageTransistorsTransistors

•• SleepSleepTransistorsTransistors

TransistorTransistor

•• 65nm65nm•• Strained SiliconStrained Silicon•• LowLow--K DielectricK Dielectric

•• More Metal LayersMore Metal Layers

ProcessProcess

ADVANTAGEADVANTAGE •• MobileMobile--Level Power ManagementLevel Power Management•• Energy Efficient PerformanceEnergy Efficient Performance

*Graphics not representative of actual die photo or relative siz*Graphics not representative of actual die photo or relative sizee

Energy Energy

IntelIntel®® Advanced Smart CacheAdvanced Smart CacheDynamic L2 Cache UsageDynamic L2 Cache Usage

ADVANTAGEADVANTAGE•• Higher Cache Hit RateHigher Cache Hit Rate•• Reduced BUS TrafficReduced BUS Traffic•• Lower Latency to DataLower Latency to Data

CoreCore™™ MicroarchitectureMicroarchitectureShared L2Shared L2

Independent L2Independent L2

Dynamically,Dynamically,BiBi--DirectionallyDirectionally

AvailableAvailable

CORE 1CORE 1 CORE 2CORE 2

L1L1CACHECACHE

L1L1CACHECACHE

CORE 1CORE 1 CORE 2CORE 2

L1L1CACHECACHE

NotNotShareableShareable

*Graphics not representative of actual die photo or relative siz*Graphics not representative of actual die photo or relative sizee

L1L1CACHECACHE

x

Energy Energy

Perf Perf

DecreasedDecreasedTrafficTraffic

IncreasedIncreasedTrafficTraffic

HARDWAREHARDWAREMem. Dis.Mem. Dis.PredictorPredictor

IntelIntel®® Smart Memory AccessSmart Memory AccessHardwareHardware--based Memory Disambiguationbased Memory Disambiguation

INST 2 INST 2 ““LOAD [Y]LOAD [Y]””

INST 1 INST 1 ““STORE [X]STORE [X]””

INST 2 INST 2 ““LOAD [Y]LOAD [Y]””

INST 1 INST 1 ““STORE [X]STORE [X]””

INST 1 INST 1 ““STORE [X]STORE [X]””

DECODE/SCHEDULEDECODE/SCHEDULE

INST 2 INST 2 ““LOAD [Y]LOAD [Y]””

OUTOUTOFOF

ORDERORDER

ININORDERORDER

EXECUTEEXECUTE

INST 2 INST 2 ““LOAD [Y]LOAD [Y]””

INST 1 INST 1 ““STORE [X]STORE [X]””

INST 2 INST 2 ““LOAD [Y]LOAD [Y]””

INST 1 INST 1 ““STORE [X]STORE [X]””

INST 1 INST 1 ““STORE [X]STORE [X]””

DECODE/SCHEDULEDECODE/SCHEDULE

INST 2 INST 2 ““LOAD [Y]LOAD [Y]””

EXECUTEEXECUTE STALLSTALL

ADVANTAGEADVANTAGE•• Higher Utilization of PipelineHigher Utilization of Pipeline•• Masks latency to data accessMasks latency to data access•• Higher PerformanceHigher Performance

CoreCore™™ MicroarchitectureMicroarchitecture OtherOther

Inst. 2 Inst. 2 ““LoadLoad””Can OccurCan Occur

BeforeBeforeInst. 1 Inst. 1 ““StoreStore””

Energy Energy

Perf Perf

Inst. 2 MustInst. 2 MustWait ForWait For

Inst. 1 Inst. 1 ““StoreStore””To CompleteTo Complete

IntelIntel®® Advanced Digital Media BoostAdvanced Digital Media BoostSingle Cycle SSESingle Cycle SSE

CoreCore™™ μμarch arch

PreviousPrevious

DECODEDECODE

X4X4

Y4Y4

X4opY4X4opY4

SOURCESOURCE

X1opY1X1opY1

DECODEDECODE

In Each Core In Each Core

X3X3

Y3Y3

X3opY3X3opY3

X2X2

Y2Y2

X2opY2X2opY2

X1X1

Y1Y1

X1opY1X1opY1

DESTDEST

SSE/2/3 OPSSE/2/3 OP

X2opY2X2opY2

X3opY3X3opY3X4opY4X4opY4

CLOCKCLOCKCYCLE 1CYCLE 1

CLOCKCLOCKCYCLE 2CYCLE 2

00127127

CLOCKCLOCKCYCLE 1CYCLE 1

SSE OperationSSE Operation(SSE/SSE2/SSE3)(SSE/SSE2/SSE3)

ADVANTAGEADVANTAGE•• Increased PerformanceIncreased Performance•• 128 bit Single Cycle in each core128 bit Single Cycle in each core•• Improved Energy EfficiencyImproved Energy Efficiency

EXECUTEEXECUTEEXECUTEEXECUTE

FusionFusionSupportSupport

SingleSingleCycleCycleSSESSE

*Graphics not representative of actual die photo or relative siz*Graphics not representative of actual die photo or relative sizee

Energy Energy

Perf Perf

Next Generation PlatformsNext Generation Platforms

•• Mobile TDPMobile TDP•• 2 Execution Cores2 Execution Cores•• 2MB & 4MB L2 Cache2MB & 4MB L2 Cache•• Mobile Power Mobile Power

OptimizationsOptimizations•• Mobile Platform *TsMobile Platform *Ts

•• 65W Mainstream TDP 65W Mainstream TDP •• 2 Execution Cores2 Execution Cores•• 2MB & 4MB L2 Cache2MB & 4MB L2 Cache•• Desktop Platform *TsDesktop Platform *Ts

•• 80W Target TDP80W Target TDP•• 40W LV Target TDP40W LV Target TDP•• 2 Execution Cores2 Execution Cores•• 4MB L2 Cache4MB L2 Cache•• Server Platform *TsServer Platform *Ts•• DP ConfigurationsDP Configurations

EnergyEnergy PerformancePerformanceEfficientEfficient

Intel® Core™Microarchitecture

Server Server OptimizedOptimized

Desktop Desktop OptimizedOptimized

Mobile Mobile OptimizedOptimized

NapaNaparefreshrefresh

AverillAverill

Bridge Bridge CreekCreek

BensleyBensley

IntelIntel®® Wide Wide Dynamic Dynamic ExecutionExecution

IntelIntel®®Intelligent Intelligent

Power Power CapabilityCapability

IntelIntel®®Advanced Advanced

Smart CacheSmart Cache

IntelIntel®® Smart Smart Memory Memory AccessAccess

IntelIntel®®Advanced Advanced

Digital Media Digital Media BoostBoost

Comprehensive Platform ArchitectureComprehensive Platform ArchitectureMemory Controller ConsiderationsMemory Controller Considerations

Rapid Adoption OfRapid Adoption OfTechnology TransitionsTechnology Transitions

Market Segments RequireMarket Segments RequireDifferent TechnologiesDifferent Technologies

MOBILE DESKTOP SERVERMOBILE DESKTOP SERVER

COHERENCY LATENCYCOHERENCY LATENCYREDUCTIONREDUCTION

PROCESS SCALING RATESPROCESS SCALING RATES

PLATFORMPLATFORM--LEVELLEVELENHANCEMENTSENHANCEMENTS

MEMORY ACCESS REDUCTIONMEMORY ACCESS REDUCTION

POWER MANAGEMENTPOWER MANAGEMENT

RELIABILITYRELIABILITY

++BusinessBusiness TechnicalTechnical

PERFORMANCEPERFORMANCE

0%

20%

40%

60%

80%

100%

1996 2000 2004 2008 2012

SDRAMSDRAMDDRDDR DDR2DDR2 DDR3DDR3

FBDFBDFPM FPM

EDOEDO

Comprehensive Platform ArchitectureComprehensive Platform ArchitectureMEMORY CONTROLLER CONSIDERATIONSMEMORY CONTROLLER CONSIDERATIONS

COHERENCY LATENCYCOHERENCY LATENCYREDUCTIONREDUCTION

PROCESS SCALING RATESPROCESS SCALING RATES

PLATFORMPLATFORM--LEVELLEVELENHANCEMENTSENHANCEMENTS

MEMORY ACCESS REDUCTIONMEMORY ACCESS REDUCTION

POWER MANAGEMENTPOWER MANAGEMENT

RELIABILITYRELIABILITY

TechnicalTechnical

PERFORMANCEPERFORMANCE

Memory ControllerMemory Controllerin Chipsetin Chipset

Extended RASExtended RASfor Serversfor Servers

Memory Closer toMemory Closer toIntegrated GFX,Integrated GFX,

Smart Cache andSmart Cache andSmart Memory Smart Memory

AccessAccess

CPU Power StatesCPU Power Statesw/ Memory Alivew/ Memory Alive

CentralizedCentralizedCoherency ControlCoherency Control

MC vs Processor MC vs Processor Device ScalingDevice Scaling

IO AccelerationIO AccelerationTechnologyTechnology

DP Server ArchitectureDP Server Architecture

CONSTANTLY ANALYZING THE REQUIREMENTS,CONSTANTLY ANALYZING THE REQUIREMENTS,THE TECHNOLOGIES, AND THE TRADEOFFSTHE TECHNOLOGIES, AND THE TRADEOFFS

*Graphics not representative of actual die photo or relative siz*Graphics not representative of actual die photo or relative sizee

Energy Energy

Perf Perf

AMB AMB AMB AMB

AMB AMB AMB

AMB

AMB AMB AMB

AMB

AMB AMB AMB AMB

Bensley PlatformBensley Platform

BlackfordBlackford

17 GB/s17 GB/s

64 GB64 GB

FSB ScalingFSB Scaling800MHz800MHz1067MHz1067MHz1333MHz1333MHz

Point to PointPoint to PointInterconnectInterconnect

Local and RemoteLocal and RemoteMemory LatenciesMemory Latencies

ConsistentConsistent

Central CoherencyCentral CoherencyResolutionResolution

Sustained &Sustained &BalancedBalanced

ThroughputThroughput

Easy CapacityEasy CapacityExpansionExpansion

LargeLargeShared CachesShared Caches

PlatformPlatformPerformance:Performance:ItIt’’s all abouts all about

BandwidthBandwidth &&LatencyLatency

CoreCore™™ Microarchitecture Advances WithMicroarchitecture Advances WithQuad CoreQuad Core

Quad CoreQuad Core

KentsfieldKentsfield

ClovertownClovertown

ServerServer

DesktopDesktop

Paxville DPPaxville DP

WoodcrestWoodcrest

IrwindaleIrwindale

DP Performance Per WattDP Performance Per WattComparison with SPECint_rateComparison with SPECint_rate

at the Platform Levelat the Platform Level

Dempsey MVDempsey MV

1X1X

2X2X

3X3X H2 H2 ‘‘0606

H1 H1 ‘‘0606

H2 H2 ‘‘0505

H1 H1 ‘‘0505

Source: IntelSource: Intel®®

*Graphics not representative of actual die photo or relative siz*Graphics not representative of actual die photo or relative sizee

Clovertown Clovertown H1 H1 ‘‘0707

EnergyEnergy PerformancePerformanceEfficientEfficient

Developing for theDeveloping for theIntelIntel®® CoreCore™™ MicroarchitectureMicroarchitecture

Program ForMulticore

Utilize SSEEnhancements

Benefit FromPower Efficiency

Exploit Energy Efficient PerformanceExploit Energy Efficient Performance

Further Insight IntoFurther Insight IntoIntelIntel®® CoreCoreTMTM MicroarchitectureMicroarchitecture

CoreCoreTMTM Microarchitecture White Paper at rear of auditoriumMicroarchitecture White Paper at rear of auditorium

IntelIntel®®'s Core's CoreTMTM Microarchitecture Microarchitecture

(Session, MATS001)(Session, MATS001)

IntelIntel®® MultiMulti--Core Architecture and Implementations Core Architecture and Implementations

(Session, MATS002)(Session, MATS002)

MultiMulti--Core and CoreCore and CoreTMTM Microarchitecture Microarchitecture

(Chalk Talk, MATC005)(Chalk Talk, MATC005)

Shop Talk with IntelShop Talk with Intel®® Fellows tomorrow morning 8Fellows tomorrow morning 8--99

Technology ShowcaseTechnology Showcase

Check out www.intel.com/multiCheck out www.intel.com/multi--corecore