Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
IntelIntel®® CoreCoreTMTM
MicroarchitectureMicroarchitectureSteve PawlowskiSteve Pawlowski
Intel Senior FellowIntel Senior FellowCTO, GM Architecture & PlanningCTO, GM Architecture & Planning
Digital Enterprise GroupDigital Enterprise Group
Ofri WechslerOfri WechslerIntel FellowIntel Fellow
Director, CPU ArchitectureDirector, CPU ArchitectureMobility GroupMobility Group
Continuing From Last FallContinuing From Last Fall’’s IDFs IDF……
LetLet’’s Take A Look Insides Take A Look Inside
IntelIntel®® CoreCore™™ MicroarchitectureMicroarchitectureThe Energy Efficient Performance LeaderThe Energy Efficient Performance Leader
CoreCore™™ Microarchitecture:Microarchitecture:Higher Performance ANDHigher Performance AND
Lower PowerLower Power
PerformancePerformance EnergyEnergyConsumptionConsumption
PlatformPlatformCapabilitiesCapabilities
1X1X
2X2X
3X3X
Source: IntelSource: Intel®®
Paxville DPPaxville DP
WoodcrestWoodcrest
IrwindaleIrwindale
Dempsey MVDempsey MV
H2 H2 ‘‘0606
H1 H1 ‘‘0606
H2 H2 ‘‘0505
H1 H1 ‘‘0505
Platform Platform –– LevelLevelEnergy EfficientEnergy Efficient
PerformancePerformance
DP Performance Per WattDP Performance Per WattComparison with SPECint_rateComparison with SPECint_rate
at the Platform Levelat the Platform Level
Energy Energy
Perf Perf
5
HistoryHistory
Branch PredictionBranch Prediction
SuperscalarSuperscalar
Out Of OrderOut Of Order
Register RenamingRegister Renaming
SpeculativeSpeculativeExecutionExecution
SSESSE
MicroarchitectureMicroarchitectureandand
ArchitectureArchitecture
NetBurstNetBurstTMTM
SSE 2SSE 2
XEONXEONTMTM
20022002
PENTIUMPENTIUM®® 4420002000
PENTIUMPENTIUM®® IIIIII19991999
PENTIUMPENTIUM®® ProPro19951995
PENTIUMPENTIUM®®
19931993
Hyper Hyper ThreadingThreading
PENTIUMPENTIUM®® MM20032003
PowerPowerManagementManagement
Micro FusionMicro Fusion
ITANIUMITANIUM®®
20012001
EPICEPIC
*Graphics not representative of actual die photo or relative siz*Graphics not representative of actual die photo or relative sizee
IntegratedIntegratedFPUFPU
PipeliningPipelining
i486i486TMTM
19891989
XEONXEON®®
20042004
EM64TEM64T
PERFORMANCE
PERFORMANCE
POWER EFFICIENT
POWER EFFICIENT
PERFORMANCE
PERFORMANCE
Historical Driving ForcesHistorical Driving Forces
0.01
0.1
1
10
1970 1980 1990 2000 2010 20200.01
0.1
1
10
1970 1980 1990 2000 2010 20201
10
100
1000
10000
100000
1970 1980 1990 2000 2010 20201
10
100
1000
10000
100000
1970 1980 1990 2000 2010 2020
Increased PerformanceIncreased Performancevia Increased Frequencyvia Increased Frequency
FrequencyFrequency(MHz)(MHz)
FeatureFeatureSizeSize(um)(um)
2005200565nm65nm
1B+ Transistors1B+ Transistors
1946194620 Numbers20 Numbers
in Main Memoryin Main Memory
19711971I4004 ProcessorI4004 Processor2300 Transistors2300 Transistors
Shrinking GeometryShrinking Geometry
The ChallengesThe Challenges
10
100
1000
1990 1995 2000 2005 2010 2015
CPUCPUPowerPower(W)(W)
30nm
45nm65nm
90nm
0.13um
0.18um0.25um
0.35um
0.5um
0.7um
0.1
1
10
1990 1993 1997 2001 2005 2009
~30%
30nm
45nm65nm
90nm
0.13um
0.18um0.25um
0.35um
0.5um
0.7um
0.1
1
10
1990 1993 1997 2001 2005 2009
~30%SupplySupplyVoltage Voltage
(V)(V)
Power = Capacitance x VoltagePower = Capacitance x Voltage22 x Frequencyx Frequencyalsoalso
Power ~ VoltagePower ~ Voltage33
Power LimitationsPower Limitations Diminishing Voltage ScalingDiminishing Voltage Scaling
17,066 Flops/Watt17,066 Flops/Watt467 Flops/Dollar467 Flops/Dollar
Energy Efficient Performance Energy Efficient Performance –– High EndHigh End
ASC PurpleASC Purple6 MWatt6 MWatt100 TFlops goal100 TFlops goal12K+ cpus 12K+ cpus –– Power5Power5
$230M$230M
DATACENTERDATACENTER““ENERGY LABELENERGY LABEL””
Source: LLNLSource: LLNL
Source: NASASource: NASA
NASA ColumbiaNASA Columbia2 MWatt2 MWatt60 TFlops goal60 TFlops goal10,240 cpus 10,240 cpus –– Itanium IIItanium II
$50M$50M
30,720 Flops/Watt30,720 Flops/Watt1,288 Flops/Dollar1,288 Flops/Dollar
ComputationalComputationalEfficiencyEfficiency
A New EraA New Era……
PerformanceEquals Frequency
Unconstrained Power
Voltage Scaling
PerformanceEquals IPC
THE OLDTHE OLD
THE NEWTHE NEW
Multi-Core
MicroarchitectureAdvancements
Power Efficiency
IntelIntel®® CoreCore™™ MicroarchitectureMicroarchitecture
*Graphics not representative of actual die photo or relative siz*Graphics not representative of actual die photo or relative sizee
ScalableScalableLow PowerLow Power High PerformanceHigh Performance
MeromMerom
ConroeConroe
WoodcrestWoodcrest
65nm65nm
Server Server OptimizedOptimized
Desktop Desktop OptimizedOptimized
Mobile Mobile OptimizedOptimized
IntelIntel®® Wide Wide Dynamic Dynamic ExecutionExecution
IntelIntel®®Intelligent Intelligent
Power Power CapabilityCapability
IntelIntel®®Advanced Advanced
Smart CacheSmart Cache
IntelIntel®® Smart Smart Memory Memory AccessAccess
IntelIntel®®Advanced Advanced
Digital Media Digital Media BoostBoost
IntelIntel®® Wide Dynamic ExecutionWide Dynamic Execution
INSTRUCTION FETCHINSTRUCTION FETCHAND PREAND PRE--DECODEDECODE
INSTRUCTION QUEUEINSTRUCTION QUEUE
RETIREMENT UNITRETIREMENT UNIT(REORDER BUFFER)(REORDER BUFFER)
DECODEDECODE
RENAME / ALLOCRENAME / ALLOC
SCHEDULERSSCHEDULERS
EXECUTEEXECUTE
INSTRUCTION FETCHINSTRUCTION FETCHAND PREAND PRE--DECODEDECODE
INSTRUCTION QUEUEINSTRUCTION QUEUE
RETIREMENT UNITRETIREMENT UNIT(REORDER BUFFER)(REORDER BUFFER)
DECODEDECODE
RENAME / ALLOCRENAME / ALLOC
SCHEDULERSSCHEDULERS
EXECUTEEXECUTE
CORE 1CORE 1 CORE 2CORE 2
4 WIDE 4 WIDE --DECODE TODECODE TO
EXECUTEEXECUTE
4 WIDE 4 WIDE --MICROMICRO--OPOPEXECUTEEXECUTE
MICROMICROandand
MACROMACROFUSIONFUSION
DEEPERDEEPERBUFFERSBUFFERS
EFFICIENTEFFICIENT14 STAGE14 STAGEPIPELINEPIPELINE
ENHANCEDENHANCEDALUsALUs
EACH COREEACH CORE
ADVANTAGEADVANTAGE•• 33% Wider Execution over Previous Gen33% Wider Execution over Previous Gen•• Comprehensive AdvancementsComprehensive Advancements•• Enabled In Each CoreEnabled In Each CoreEnergy Energy
Perf Perf
IntelIntel®® Wide Dynamic ExecutionWide Dynamic ExecutionMicro and Macro FusionMicro and Macro Fusion
DECODEDECODE
EXECUTEEXECUTE
uCODEuCODEROMROM
ADVANTAGEADVANTAGE •• Instruction Load Reduced ~ 15%Instruction Load Reduced ~ 15%****
•• MicroMicro--Ops Reduced ~ 10%Ops Reduced ~ 10%****
WITHOUT MACRO FUSIONWITHOUT MACRO FUSIONWITH MACRO FUSIONWITH MACRO FUSION
INSTRUCTION 3INSTRUCTION 3
INSTRUCTION 2INSTRUCTION 2
INSTRUCTION 1INSTRUCTION 1
INTERNAL INST 1INTERNAL INST 1
COMPLETED INST 3COMPLETED INST 3
COMPLETED INST 2COMPLETED INST 2
COMPLETED INST 1COMPLETED INST 1
DECODEDECODE
EXECUTEEXECUTE
INSTRUCTION 3INSTRUCTION 3
INSTRUCTION 2INSTRUCTION 2
INSTRUCTION 1INSTRUCTION 1
INTERNAL INST 3INTERNAL INST 3
INTERNAL INST 2INTERNAL INST 2
INTERNAL INST 1INTERNAL INST 1
COMPLETED INST 3COMPLETED INST 3
COMPLETED INST 2COMPLETED INST 2
COMPLETED INST 1COMPLETED INST 1
DECODEDECODE
EXECUTEEXECUTE
COMBINED INST 2 & 3COMBINED INST 2 & 3
MACRO FUSION EXAMPLEMACRO FUSION EXAMPLECMP+JMP IN 1 CLOCKCMP+JMP IN 1 CLOCK
MicroMicroFusionFusion
MacroMacroFusionFusion
*Graphics not representative of actual die photo or relative siz*Graphics not representative of actual die photo or relative sizee** Workload dependant** Workload dependant
Energy Energy
Perf Perf
IntelIntel®® Intelligent Power CapabilityIntelligent Power Capability
UltraUltraFineFine
GrainedGrainedCoarseCoarseGrainedGrained
•• AggressiveAggressiveClock GatingClock Gating•• EnhancedEnhancedSpeedSpeed--StepStep
•• Low VCC ArraysLow VCC Arrays•• Blocks ControlledBlocks Controlled
Via SleepVia SleepTransistorsTransistors
•• Low LeakageLow LeakageTransistorsTransistors
•• SleepSleepTransistorsTransistors
TransistorTransistor
•• 65nm65nm•• Strained SiliconStrained Silicon•• LowLow--K DielectricK Dielectric
•• More Metal LayersMore Metal Layers
ProcessProcess
ADVANTAGEADVANTAGE •• MobileMobile--Level Power ManagementLevel Power Management•• Energy Efficient PerformanceEnergy Efficient Performance
*Graphics not representative of actual die photo or relative siz*Graphics not representative of actual die photo or relative sizee
Energy Energy
IntelIntel®® Advanced Smart CacheAdvanced Smart CacheDynamic L2 Cache UsageDynamic L2 Cache Usage
ADVANTAGEADVANTAGE•• Higher Cache Hit RateHigher Cache Hit Rate•• Reduced BUS TrafficReduced BUS Traffic•• Lower Latency to DataLower Latency to Data
CoreCore™™ MicroarchitectureMicroarchitectureShared L2Shared L2
Independent L2Independent L2
Dynamically,Dynamically,BiBi--DirectionallyDirectionally
AvailableAvailable
CORE 1CORE 1 CORE 2CORE 2
L1L1CACHECACHE
L1L1CACHECACHE
CORE 1CORE 1 CORE 2CORE 2
L1L1CACHECACHE
NotNotShareableShareable
*Graphics not representative of actual die photo or relative siz*Graphics not representative of actual die photo or relative sizee
L1L1CACHECACHE
x
Energy Energy
Perf Perf
DecreasedDecreasedTrafficTraffic
IncreasedIncreasedTrafficTraffic
HARDWAREHARDWAREMem. Dis.Mem. Dis.PredictorPredictor
IntelIntel®® Smart Memory AccessSmart Memory AccessHardwareHardware--based Memory Disambiguationbased Memory Disambiguation
INST 2 INST 2 ““LOAD [Y]LOAD [Y]””
INST 1 INST 1 ““STORE [X]STORE [X]””
INST 2 INST 2 ““LOAD [Y]LOAD [Y]””
INST 1 INST 1 ““STORE [X]STORE [X]””
INST 1 INST 1 ““STORE [X]STORE [X]””
DECODE/SCHEDULEDECODE/SCHEDULE
INST 2 INST 2 ““LOAD [Y]LOAD [Y]””
OUTOUTOFOF
ORDERORDER
ININORDERORDER
EXECUTEEXECUTE
INST 2 INST 2 ““LOAD [Y]LOAD [Y]””
INST 1 INST 1 ““STORE [X]STORE [X]””
INST 2 INST 2 ““LOAD [Y]LOAD [Y]””
INST 1 INST 1 ““STORE [X]STORE [X]””
INST 1 INST 1 ““STORE [X]STORE [X]””
DECODE/SCHEDULEDECODE/SCHEDULE
INST 2 INST 2 ““LOAD [Y]LOAD [Y]””
EXECUTEEXECUTE STALLSTALL
ADVANTAGEADVANTAGE•• Higher Utilization of PipelineHigher Utilization of Pipeline•• Masks latency to data accessMasks latency to data access•• Higher PerformanceHigher Performance
CoreCore™™ MicroarchitectureMicroarchitecture OtherOther
Inst. 2 Inst. 2 ““LoadLoad””Can OccurCan Occur
BeforeBeforeInst. 1 Inst. 1 ““StoreStore””
Energy Energy
Perf Perf
Inst. 2 MustInst. 2 MustWait ForWait For
Inst. 1 Inst. 1 ““StoreStore””To CompleteTo Complete
IntelIntel®® Advanced Digital Media BoostAdvanced Digital Media BoostSingle Cycle SSESingle Cycle SSE
CoreCore™™ μμarch arch
PreviousPrevious
DECODEDECODE
X4X4
Y4Y4
X4opY4X4opY4
SOURCESOURCE
X1opY1X1opY1
DECODEDECODE
In Each Core In Each Core
X3X3
Y3Y3
X3opY3X3opY3
X2X2
Y2Y2
X2opY2X2opY2
X1X1
Y1Y1
X1opY1X1opY1
DESTDEST
SSE/2/3 OPSSE/2/3 OP
X2opY2X2opY2
X3opY3X3opY3X4opY4X4opY4
CLOCKCLOCKCYCLE 1CYCLE 1
CLOCKCLOCKCYCLE 2CYCLE 2
00127127
CLOCKCLOCKCYCLE 1CYCLE 1
SSE OperationSSE Operation(SSE/SSE2/SSE3)(SSE/SSE2/SSE3)
ADVANTAGEADVANTAGE•• Increased PerformanceIncreased Performance•• 128 bit Single Cycle in each core128 bit Single Cycle in each core•• Improved Energy EfficiencyImproved Energy Efficiency
EXECUTEEXECUTEEXECUTEEXECUTE
FusionFusionSupportSupport
SingleSingleCycleCycleSSESSE
*Graphics not representative of actual die photo or relative siz*Graphics not representative of actual die photo or relative sizee
Energy Energy
Perf Perf
Next Generation PlatformsNext Generation Platforms
•• Mobile TDPMobile TDP•• 2 Execution Cores2 Execution Cores•• 2MB & 4MB L2 Cache2MB & 4MB L2 Cache•• Mobile Power Mobile Power
OptimizationsOptimizations•• Mobile Platform *TsMobile Platform *Ts
•• 65W Mainstream TDP 65W Mainstream TDP •• 2 Execution Cores2 Execution Cores•• 2MB & 4MB L2 Cache2MB & 4MB L2 Cache•• Desktop Platform *TsDesktop Platform *Ts
•• 80W Target TDP80W Target TDP•• 40W LV Target TDP40W LV Target TDP•• 2 Execution Cores2 Execution Cores•• 4MB L2 Cache4MB L2 Cache•• Server Platform *TsServer Platform *Ts•• DP ConfigurationsDP Configurations
EnergyEnergy PerformancePerformanceEfficientEfficient
Intel® Core™Microarchitecture
Server Server OptimizedOptimized
Desktop Desktop OptimizedOptimized
Mobile Mobile OptimizedOptimized
NapaNaparefreshrefresh
AverillAverill
Bridge Bridge CreekCreek
BensleyBensley
IntelIntel®® Wide Wide Dynamic Dynamic ExecutionExecution
IntelIntel®®Intelligent Intelligent
Power Power CapabilityCapability
IntelIntel®®Advanced Advanced
Smart CacheSmart Cache
IntelIntel®® Smart Smart Memory Memory AccessAccess
IntelIntel®®Advanced Advanced
Digital Media Digital Media BoostBoost
Comprehensive Platform ArchitectureComprehensive Platform ArchitectureMemory Controller ConsiderationsMemory Controller Considerations
Rapid Adoption OfRapid Adoption OfTechnology TransitionsTechnology Transitions
Market Segments RequireMarket Segments RequireDifferent TechnologiesDifferent Technologies
MOBILE DESKTOP SERVERMOBILE DESKTOP SERVER
COHERENCY LATENCYCOHERENCY LATENCYREDUCTIONREDUCTION
PROCESS SCALING RATESPROCESS SCALING RATES
PLATFORMPLATFORM--LEVELLEVELENHANCEMENTSENHANCEMENTS
MEMORY ACCESS REDUCTIONMEMORY ACCESS REDUCTION
POWER MANAGEMENTPOWER MANAGEMENT
RELIABILITYRELIABILITY
++BusinessBusiness TechnicalTechnical
PERFORMANCEPERFORMANCE
0%
20%
40%
60%
80%
100%
1996 2000 2004 2008 2012
SDRAMSDRAMDDRDDR DDR2DDR2 DDR3DDR3
FBDFBDFPM FPM
EDOEDO
Comprehensive Platform ArchitectureComprehensive Platform ArchitectureMEMORY CONTROLLER CONSIDERATIONSMEMORY CONTROLLER CONSIDERATIONS
COHERENCY LATENCYCOHERENCY LATENCYREDUCTIONREDUCTION
PROCESS SCALING RATESPROCESS SCALING RATES
PLATFORMPLATFORM--LEVELLEVELENHANCEMENTSENHANCEMENTS
MEMORY ACCESS REDUCTIONMEMORY ACCESS REDUCTION
POWER MANAGEMENTPOWER MANAGEMENT
RELIABILITYRELIABILITY
TechnicalTechnical
PERFORMANCEPERFORMANCE
Memory ControllerMemory Controllerin Chipsetin Chipset
Extended RASExtended RASfor Serversfor Servers
Memory Closer toMemory Closer toIntegrated GFX,Integrated GFX,
Smart Cache andSmart Cache andSmart Memory Smart Memory
AccessAccess
CPU Power StatesCPU Power Statesw/ Memory Alivew/ Memory Alive
CentralizedCentralizedCoherency ControlCoherency Control
MC vs Processor MC vs Processor Device ScalingDevice Scaling
IO AccelerationIO AccelerationTechnologyTechnology
DP Server ArchitectureDP Server Architecture
CONSTANTLY ANALYZING THE REQUIREMENTS,CONSTANTLY ANALYZING THE REQUIREMENTS,THE TECHNOLOGIES, AND THE TRADEOFFSTHE TECHNOLOGIES, AND THE TRADEOFFS
*Graphics not representative of actual die photo or relative siz*Graphics not representative of actual die photo or relative sizee
Energy Energy
Perf Perf
AMB AMB AMB AMB
AMB AMB AMB
AMB
AMB AMB AMB
AMB
AMB AMB AMB AMB
Bensley PlatformBensley Platform
BlackfordBlackford
17 GB/s17 GB/s
64 GB64 GB
FSB ScalingFSB Scaling800MHz800MHz1067MHz1067MHz1333MHz1333MHz
Point to PointPoint to PointInterconnectInterconnect
Local and RemoteLocal and RemoteMemory LatenciesMemory Latencies
ConsistentConsistent
Central CoherencyCentral CoherencyResolutionResolution
Sustained &Sustained &BalancedBalanced
ThroughputThroughput
Easy CapacityEasy CapacityExpansionExpansion
LargeLargeShared CachesShared Caches
PlatformPlatformPerformance:Performance:ItIt’’s all abouts all about
BandwidthBandwidth &&LatencyLatency
CoreCore™™ Microarchitecture Advances WithMicroarchitecture Advances WithQuad CoreQuad Core
Quad CoreQuad Core
KentsfieldKentsfield
ClovertownClovertown
ServerServer
DesktopDesktop
Paxville DPPaxville DP
WoodcrestWoodcrest
IrwindaleIrwindale
DP Performance Per WattDP Performance Per WattComparison with SPECint_rateComparison with SPECint_rate
at the Platform Levelat the Platform Level
Dempsey MVDempsey MV
1X1X
2X2X
3X3X H2 H2 ‘‘0606
H1 H1 ‘‘0606
H2 H2 ‘‘0505
H1 H1 ‘‘0505
Source: IntelSource: Intel®®
*Graphics not representative of actual die photo or relative siz*Graphics not representative of actual die photo or relative sizee
Clovertown Clovertown H1 H1 ‘‘0707
EnergyEnergy PerformancePerformanceEfficientEfficient
Developing for theDeveloping for theIntelIntel®® CoreCore™™ MicroarchitectureMicroarchitecture
Program ForMulticore
Utilize SSEEnhancements
Benefit FromPower Efficiency
Exploit Energy Efficient PerformanceExploit Energy Efficient Performance
Further Insight IntoFurther Insight IntoIntelIntel®® CoreCoreTMTM MicroarchitectureMicroarchitecture
CoreCoreTMTM Microarchitecture White Paper at rear of auditoriumMicroarchitecture White Paper at rear of auditorium
IntelIntel®®'s Core's CoreTMTM Microarchitecture Microarchitecture
(Session, MATS001)(Session, MATS001)
IntelIntel®® MultiMulti--Core Architecture and Implementations Core Architecture and Implementations
(Session, MATS002)(Session, MATS002)
MultiMulti--Core and CoreCore and CoreTMTM Microarchitecture Microarchitecture
(Chalk Talk, MATC005)(Chalk Talk, MATC005)
Shop Talk with IntelShop Talk with Intel®® Fellows tomorrow morning 8Fellows tomorrow morning 8--99
Technology ShowcaseTechnology Showcase
Check out www.intel.com/multiCheck out www.intel.com/multi--corecore