Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
IntelIntel®® CoreCore™™MicroarchitectureMicroarchitecture
March 8, 2006March 8, 2006
NEW
Stephen L. SmithVice President
Digital Enterprise Group
Bob ValentineArchitect
Intel Architecture Group
2
Agenda
• Multi-core Update and New Microarchitecture Level Set
• New Intel® Core™ Microarchitecture
• Wrap Up
3
Intel Multi-core Roadmap – Updates since Fall IDF
4
Ramping Multi-core Everywhere
All products and dates are preliminary and subject to change without notice.
5
Refresher: What is Multi-Core?Two or more independent execution cores in the same processor
Specific implementations will vary over time - driven by product implementation and manufacturing efficiencies• Best mix of product architecture and volume mfg capabilities
– Architecture: Shared Caches vs. Independent Caches– Mfg capabilities: volume packaging technology
• Designed to deliver performance, OEM and end user experience
MultiMulti--Chip ProcessorChip ProcessorSingle die (Monolithic) based processorSingle die (Monolithic) based processorExample: 90nm PentiumExample: 90nm Pentium®® D D
Processor (Smithfield)Example: Intel CoreExample: Intel Core™™ Duo Duo
Processor (Yonah)Example: 65nm Pentium D Example: 65nm Pentium D
Processor (Presler)Processor (Smithfield) Processor (Yonah) Processor (Presler)
Front Side BusFront Side Bus
Core0Core0 Core1Core1 Core0Core0 Core1Core1
Front Side BusFront Side Bus
Core0Core0 Core1Core1
Front Side BusFront Side Bus
*Not representative of actual die photos or relative size
6
Intel® Core™ Micro-architecture
*Not representative of actual die photo or relative size
7
Intel Multi-core Roadmap
8
Intel Multi-core Roadmap
9
Intel® Core™ Microarchitecture Based PlatformsPlatformPlatform 2006 20072006 2007
MP ServersMP Servers
DP Servers/DP Servers/DP WorkstationDP Workstation
Mobile Client Mobile Client
Desktop Desktop --Home Home
Desktop Desktop --Office Office
Caneland Platform (2007) Tigerton (QC) (2007)
Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) )Woodcrest (Q3’06)
Clovertown (QC) (Q1’07)
Bridge Creek Platform (Mid’06)Conroe (Q3’06)
Kentsfield (QC) (Q1’07)
Kaylo Platform (Q3’06)/ Wyloway Platform (Q3 ’06)Conroe (Q3’06)
Kentsfield (QC) (Q1’07)UP Servers/UP Servers/UP WorkstationUP Workstation
Napa Platform (Q1’06)Merom (2H’06)
Averill Platform (Mid’06) Conroe (Q3’06)
All products and dates are preliminary and subject to change without notice.
Note: only Intel® Core™ microarchitecture based processors listed
QC refers to Quad-Core
10
Intel® Core™ MicroarchitecturePerformance
Delivering both industry leading performance and performance/watt• Conroe: >40% improvement in performance1 & >40%
reduction in power2
– As compared to today’s high-end Pentium® D processor 950 (formerly Presler)
• Woodcrest: >80% improvement in performance1 and > 35% reduction in power2
– As compared to today’s high-end Dual-Core Intel® Xeon® processor 2.8GHz (formerly Paxville DP)
• Merom: Extends the already significant performance and performance/watt leadership delivered with today's Intel® Core™ Duo processor with greater than 20% additional performance1 improvement– As compared to today’s high-end Intel® Core™ Duo processor (formerly Yonah)
1 - Estimated SPECint*_rate_base2000
2 – Expected reduction in TDP
11
Agenda
• Multi-core Update and New Micro-architecture level set
• New Intel® Core™ Microarchitecture
• Summary
12
Inside the IntelInside the Intel®® CoreCore™™MicroarchitectureMicroarchitecture
13
AgendaAgenda
–– MultiMulti--core Update and New Microcore Update and New Micro--architecture level set architecture level set
–– New IntelNew Intel®® CoreCore™™ MicroarchitectureMicroarchitecture–– Intel Microarchitecture HistoryIntel Microarchitecture History
–– IntelIntel®® CoreCore™™ Microarchitecture Design Goals and RoadmapMicroarchitecture Design Goals and Roadmap
–– Processor Architecture 101 Processor Architecture 101
–– IntelIntel®® CoreCore™™ MicroarchitectureMicroarchitecture
–– Software ImplicationsSoftware Implications
–– Wrap UpWrap Up
14
Microarchitecture HistoryMicroarchitecture History
15
New Microarchitecture Coming in 2006New Microarchitecture Coming in 2006
16
AgendaAgenda
–– MultiMulti--core Update and New Microcore Update and New Micro--architecture level set architecture level set
–– New IntelNew Intel®® CoreCore™™ MicroarchitectureMicroarchitecture–– Intel Microarchitecture HistoryIntel Microarchitecture History
–– IntelIntel®® CoreCore™™ Microarchitecture Design GoalsMicroarchitecture Design Goals
–– Processor Architecture 101 Processor Architecture 101
–– IntelIntel®® CoreCore™™ MicroarchitectureMicroarchitecture
–– Software ImplicationsSoftware Implications
–– Wrap UpWrap Up
17
IntelIntel®® CoreCore™™ Microarchitecture: Microarchitecture: Design GoalsDesign Goals
Deliver world class performance combined Deliver world class performance combined with superior energy/power efficiency with superior energy/power efficiency –– Existing and emerging applications and usesExisting and emerging applications and uses–– Greater performance and performance/wattGreater performance and performance/watt–– Optimized for Intel MultiOptimized for Intel Multi--core platformscore platforms
Deliver single foundation for optimized Deliver single foundation for optimized processors across each segment and power processors across each segment and power envelopeenvelope–– Optimized for mobile, desktop and server segmentsOptimized for mobile, desktop and server segments
Driving Performance and Driving Performance and Performance/Watt LeadershipPerformance/Watt Leadership
18
AgendaAgenda
–– MultiMulti--core Update and New Microcore Update and New Micro--architecture level set architecture level set
–– New IntelNew Intel®® CoreCore™™ MicroarchitectureMicroarchitecture–– Intel Microarchitecture HistoryIntel Microarchitecture History
–– IntelIntel®® CoreCore™™ Microarchitecture Design Goals Microarchitecture Design Goals
–– Processor Architecture 101 Processor Architecture 101
–– IntelIntel®® CoreCore™™ MicroarchitectureMicroarchitecture
–– Software ImplicationsSoftware Implications
–– Wrap UpWrap Up
19
Processor Architecture 101Processor Architecture 101
Delivered Performance = Frequency * Instructions Per Cycle (IPC)
Delivered Performance = Delivered Performance = Frequency * Instructions Per Cycle (IPC)Frequency * Instructions Per Cycle (IPC)
Goal is higher performance and lower power
Power α Cdynamic * V * V * FrequencyPower Power αα CCdynamicdynamic * V * V * Frequency* V * V * Frequency
Cdynamic is roughly a product of area and activity“how many bits” * “how much do they toggle”
20
Processor Architecture 101Processor Architecture 101
Delivered Performance = Frequency * Instructions Per Cycle (IPC)
Delivered Performance = Delivered Performance = Frequency * Instructions Per Cycle (IPC)Frequency * Instructions Per Cycle (IPC)
Frequency is proportional to voltage,Frequency is proportional to voltage,so frequency reduction coupled with so frequency reduction coupled with voltage reduction results in cubic voltage reduction results in cubic reduction in power. reduction in power.
Power α Cdynamic * V * V * FrequencyPower Power αα CCdynamicdynamic * V * V * Frequency* V * V * Frequency
21
Processor Architecture 101Processor Architecture 101
Delivered Performance = Frequency * Instructions Per Cycle (IPC)
Delivered Performance = Delivered Performance = Frequency * Instructions Per Cycle (IPC)Frequency * Instructions Per Cycle (IPC)
Higher IPC usuallyHigher IPC usuallyresults in wider data pathsresults in wider data pathsand/or more speculation :and/or more speculation :directly increasing C dynamicdirectly increasing C dynamic
Power α Cdynamic * V * V * FrequencyPower Power αα CCdynamicdynamic * V * V * Frequency* V * V * Frequency
22
AgendaAgenda
–– MultiMulti--core Update and New Microcore Update and New Micro--architecture level set architecture level set
–– New IntelNew Intel®® CoreCore™™ MicroarchitectureMicroarchitecture–– Intel Microarchitecture HistoryIntel Microarchitecture History
–– IntelIntel®® CoreCore™™ Microarchitecture Design Goals Microarchitecture Design Goals
–– Processor Architecture 101 Processor Architecture 101
–– IntelIntel®® CoreCore™™ MicroarchitectureMicroarchitecture
–– Software ImplicationsSoftware Implications
–– Wrap UpWrap Up
23
2M/4Mshared L2
Cache
up to10.4 Gb/s
FSB
L1 D-Cache and D-TLB
LoadLoad
SchedulersSchedulers
Retirement UnitRetirement Unit((ReOrderReOrder Buffer)Buffer)
ALUBranch
MMX/SSEFPmove
DecodeDecode
Rename/AllocRename/Alloc
uCodeuCodeROMROM
Instruction Fetch Instruction Fetch and and PreDecodePreDecode
ALUFAdd
MMX/SSEFPmove
ALUALUFMulFMul
MMX/SSEMMX/SSEFPmoveFPmove
Instruction QueueInstruction Queue
StoreStore
4444
4444
5555
IntelIntel®® CoreCore™™MicroarchitectureMicroarchitecture
Block Diagram Walkthrough
24
2M/4Mshared L2
Cache
up to10.4 Gb/s
FSB
L1 D-Cache and D-TLB
LoadLoad
SchedulersSchedulers
Retirement UnitRetirement Unit((ReOrderReOrder Buffer)Buffer)
ALUBranch
MMX/SSEFPmove
DecodeDecode
Rename/AllocRename/Alloc
uCodeuCodeROMROM
Instruction Fetch Instruction Fetch and and PreDecodePreDecode
ALUFAdd
MMX/SSEFPmove
ALUALUFMulFMul
MMX/SSEMMX/SSEFPmoveFPmove
Instruction QueueInstruction Queue
StoreStore
4444
4444
5555
IntelIntel®® CoreCore™™MicroarchitectureMicroarchitecture
in orderin order
instruction fetchinstruction fetchinstruction decodeinstruction decodemicromicro--op renameop renamemicromicro--op allocateop allocate
25
2M/4Mshared L2
Cache
up to10.4 Gb/s
FSB
L1 D-Cache and D-TLB
LoadLoad
SchedulersSchedulers
Retirement UnitRetirement Unit((ReOrderReOrder Buffer)Buffer)
ALUBranch
MMX/SSEFPmove
DecodeDecode
Rename/AllocRename/Alloc
uCodeuCodeROMROM
Instruction Fetch Instruction Fetch and and PreDecodePreDecode
ALUFAdd
MMX/SSEFPmove
ALUALUFMulFMul
MMX/SSEMMX/SSEFPmoveFPmove
Instruction QueueInstruction Queue
StoreStore
4444
4444
5555
IntelIntel®® CoreCore™™MicroarchitectureMicroarchitecture
out of orderout of order
micromicro--op scheduleop schedulemicromicro--op executeop execute
26
2M/4Mshared L2
Cache
up to10.4 Gb/s
FSB
L1 D-Cache and D-TLB
LoadLoad
SchedulersSchedulers
Retirement UnitRetirement Unit((ReOrderReOrder Buffer)Buffer)
ALUBranch
MMX/SSEFPmove
DecodeDecode
Rename/AllocRename/Alloc
uCodeuCodeROMROM
Instruction Fetch Instruction Fetch and and PreDecodePreDecode
ALUFAdd
MMX/SSEFPmove
ALUALUFMulFMul
MMX/SSEMMX/SSEFPmoveFPmove
Instruction QueueInstruction Queue
StoreStore
4444
4444
5555
IntelIntel®® CoreCore™™MicroarchitectureMicroarchitecture
out of orderout of order
memory pipelinesmemory pipelines
memory order unitmemory order unitmaintains architecturalmaintains architecturalordering requirementsordering requirements
27
2M/4Mshared L2
Cache
up to10.4 Gb/s
FSB
L1 D-Cache and D-TLB
LoadLoad
SchedulersSchedulers
Retirement UnitRetirement Unit((ReOrderReOrder Buffer)Buffer)
ALUBranch
MMX/SSEFPmove
DecodeDecode
Rename/AllocRename/Alloc
uCodeuCodeROMROM
Instruction Fetch Instruction Fetch and and PreDecodePreDecode
ALUFAdd
MMX/SSEFPmove
ALUALUFMulFMul
MMX/SSEMMX/SSEFPmoveFPmove
Instruction QueueInstruction Queue
StoreStore
4444
4444
5555
IntelIntel®® CoreCore™™MicroarchitectureMicroarchitecture
in orderin order
micromicro--op retirementop retirementfault handlingfault handling
Retirement UnitRetirement Unitmaintains illusionmaintains illusionof in orderof in orderinstruction retirementinstruction retirement
28
IntelIntel®® CoreCore™™Microarchitecture
2M/4Mshared L2
Cache
up to10.4 Gb/s
FSB
L1 D-Cache and D-TLB
LoadLoad
SchedulersSchedulers
Retirement UnitRetirement Unit((ReOrderReOrder Buffer)Buffer)
ALUBranch
MMX/SSEFPmove
DecodeDecode
Rename/AllocRename/Alloc
uCodeuCodeROMROM
Instruction Fetch Instruction Fetch and and PreDecodePreDecode
ALUFAdd
MMX/SSEFPmove
ALUALUFMulFMul
MMX/SSEMMX/SSEFPmoveFPmove
Instruction QueueInstruction Queue
StoreStore
4444
4444
5555
Microarchitecture
Wide Dynamic Execution
Advanced Digital Media Boost
Smart Memory Access
Advanced Smart Cache
Intelligent Power Capability
New, StateNew, State--ofof--thethe--Art, Art, MicroarchitectureMicroarchitecture
29
2M/4Mshared L2
Cache
up to10.4 Gb/s
FSB
L1 D-Cache and D-TLB
LoadLoad
SchedulersSchedulers
Retirement UnitRetirement Unit((ReOrderReOrder Buffer)Buffer)
ALUBranch
MMX/SSEFPmove
DecodeDecode
Rename/AllocRename/Alloc
uCodeuCodeROMROM
Instruction Fetch Instruction Fetch and and PreDecodePreDecode
ALUFAdd
MMX/SSEFPmove
ALUALUFMulFMul
MMX/SSEMMX/SSEFPmoveFPmove
Instruction QueueInstruction Queue
StoreStore
4444
4444
5555
Wide Dynamic Wide Dynamic ExecutionExecution
Start with Instruction Fetch
four(+) instructions / cycle
>33% increase over other x86 processors
Instructions converted to micro-ops (uops)
~1 uop per x86 instruction
30
MicroMicro--op Reduction op Reduction (recall Processor (recall Processor 101)101)
Fewer Fewer uopsuops per instructionper instructionallows IPC to be increasedallows IPC to be increasedwhile lowering C dynamicwhile lowering C dynamic(less bits and less toggling)(less bits and less toggling)
Delivered Performance = Frequency * Instructions Per Cycle (IPC)
Delivered Performance = Delivered Performance = Frequency * Instructions Per Cycle (IPC)Frequency * Instructions Per Cycle (IPC)
Power = Cdynamic * V * V * FrequencyPower = Power = CCdynamicdynamic * V * V * Frequency* V * V * Frequency
31
Techniques for MicroTechniques for Micro--op Reductionop Reduction
ESP Tracker (Extended Stack Pointer)ESP Tracker (Extended Stack Pointer)–– Execute Stack Pointer updates in dedicated hardwareExecute Stack Pointer updates in dedicated hardware–– IntelIntel®® CoreCore™™ microarchitecture increases BW by 33%*microarchitecture increases BW by 33%*
MicroMicro--Op MicroOp Micro--FusionFusion–– Single Single UopUop representation of representation of ““multimulti--uopuop”” instruction instruction –– IntelIntel®® CoreCore™™ microarchitecture increase # instructions*microarchitecture increase # instructions*
MacroMacro--FusionFusion–– New technique in IntelNew technique in Intel®® CoreCore™™ microarchitecture (more on microarchitecture (more on
next pages)next pages)
* Techniques pioneered on Intel* Techniques pioneered on Intel®® PentiumPentium®® M processorsM processors
32
New: MacroNew: Macro--FusionFusion
Represent common x86 instruction pairs in Represent common x86 instruction pairs in single microsingle micro--opop––CMP or TEST + Conditional Branch (CMP or TEST + Conditional Branch (JccJcc))
Enhanced Arithmetic Logic Unit (ALU) for Enhanced Arithmetic Logic Unit (ALU) for macromacro--fusionfusion––Single dispatch Single dispatch -- efficiencyefficiency
––Single cycle execution Single cycle execution -- performanceperformance
33
WithoutWithoutMacroMacro--FusionFusion
Instruction Queue
Read four instructions from Instruction Queue
Each instruction gets decoded into separate uops
store [mem3], ebx
load eax, [mem1]cmp eax, [mem2]
jne targ
inc ecx
inc ecx
store [mem3], ebx
dec1 dec2 dec3
jne targ
load eax, [mem1]
cmp eax, [mem2]
dec0
Cycle 1
Cycle 2
34
With IntelWith Intel’’s New s New MacroMacro--FusionFusion
Read five Instructions from Instruction Queue
Send fusable pair to single decoder
Single uop represents two instructions
store [mem3], ebx
load eax, [mem1]cmpjne eax, [mem2], targ
inc ecx
dec1
Instruction Queue
inc ecx
dec2 dec3
load eax, [mem1]
cmp eax, [mem2]
jne targ
store [mem3], ebx
dec0
Cycle 1
35
cmpjne eax, mem2, targScheduler
Execution
flags and target to Write back
BranchEval
MacroMacro--FusionFusion(cont)(cont)
Lower latencyIncreased bandwidth“virtually” increase storage
Macro-fusion makes the machine behave as if it is wider and deeper, withoutthe additional cost
Enabling Greater Performance & Enabling Greater Performance & EfficiencyEfficiency
36
2M/4Mshared L2
Cache
up to10.4 Gb/s
FSB
L1 D-Cache and D-TLB
LoadLoad
SchedulersSchedulers
Retirement UnitRetirement Unit((ReOrderReOrder Buffer)Buffer)
ALUBranch
MMX/SSEFPmove
DecodeDecode
Rename/AllocRename/Alloc
uCodeuCodeROMROM
Instruction Fetch Instruction Fetch and and PreDecodePreDecode
ALUFAdd
MMX/SSEFPmove
ALUALUFMulFMul
MMX/SSEMMX/SSEFPmoveFPmove
Instruction QueueInstruction Queue
StoreStore
4444
4444
5555
Wide Dynamic Wide Dynamic ExecutionExecution
4 wide rename4 wide micro-op execution 4 wide retire
Deeper out of order storage
32 discontiguous micro-opsconsidered for dispatch per cycle
33% Wider Than Previous 33% Wider Than Previous GenerationGeneration
37
2M/4Mshared L2
Cache
up to10.4 Gb/s
FSB
L1 D-Cache and D-TLB
LoadLoad
SchedulersSchedulers
Retirement UnitRetirement Unit((ReOrderReOrder Buffer)Buffer)
ALUBranch
MMX/SSEFPmove
DecodeDecode
Rename/AllocRename/Alloc
uCodeuCodeROMROM
Instruction Fetch Instruction Fetch and and PreDecodePreDecode
ALUFAdd
MMX/SSEFPmove
ALUALUFMulFMul
MMX/SSEMMX/SSEFPmoveFPmove
Instruction QueueInstruction Queue
StoreStore
4444
4444
5555
Advanced Digital Advanced Digital Media BoostMedia Boost
128-bit packed Multiplyplus
128-bit packed Addplus
128-bit packed Loadplus
128-bit packed Storeplus
(how about a CMPJCC)
2x Compute Throughput / 2x Compute Throughput / ClockClock
38
Advanced Digital Media BoostAdvanced Digital Media Boost
Lets scale a vector: Lets scale a vector: B[iB[i] := ] := A[iA[i] * C] * C
A
B
ExistingProcessor
IntelIntel®® CoreCore™™ uuarcharchAdvanced Digital Media BoostAdvanced Digital Media Boost
2x Compute Throughput / 2x Compute Throughput / ClockClock
39
ExistingProcessor
IntelIntel®® CoreCore™™ uuarcharchAdvanced Digital Media BoostAdvanced Digital Media Boost
Assume both Microarchitectures have 128Assume both Microarchitectures have 128--bit bit path from L1 to Processorpath from L1 to Processor
Advanced Digital Media BoostAdvanced Digital Media Boost
A
B
2x Compute Throughput / 2x Compute Throughput / ClockClock
40
Advanced Digital Media BoostAdvanced Digital Media Boost
...handles all the memory data...handles all the memory data
A
B
ExistingProcessor
IntelIntel®® CoreCore™™ uuarcharchAdvanced Digital Media BoostAdvanced Digital Media Boost
Multiply can’tkeep up with load bandwidth
multiplieroperateson all data
2x Compute Throughput / 2x Compute Throughput / ClockClock
41
Advanced Digital Media BoostAdvanced Digital Media BoostExisting implementations eventually stall the load Existing implementations eventually stall the load
pipe waiting for multiplierpipe waiting for multiplier
ExistingProcessor
IntelIntel®® CoreCore™™ uuarcharchAdvanced Digital Media BoostAdvanced Digital Media Boost
A
B
Load eventuallystalls waiting formultiplier
Load pipeis free to advance
2x Compute Throughput / 2x Compute Throughput / ClockClock
42
Advanced Digital Media BoostAdvanced Digital Media Boost
...keeps pipeline free for computations...keeps pipeline free for computations
ExistingProcessor
IntelIntel®® CoreCore™™ uuarcharchAdvanced Digital Media BoostAdvanced Digital Media Boost
A
B
Load eventuallystalls waiting formultiplier
Load pipeis free to advance
2x Compute Throughput / 2x Compute Throughput / ClockClock
43
Advanced Digital Media BoostAdvanced Digital Media Boost...maintains 2X throughput compared to prior ...maintains 2X throughput compared to prior
implementationsimplementations
ExistingProcessor
IntelIntel®® CoreCore™™ uuarcharchAdvanced Digital Media BoostAdvanced Digital Media Boost
A
B
Load eventuallystalls waiting formultiplier
Load pipeis free to advance
2x Compute Throughput / 2x Compute Throughput / ClockClock
44
Advanced Digital Media BoostAdvanced Digital Media Boost
8 Single Precision Flops/cycle8 Single Precision Flops/cycle
ExistingProcessor
IntelIntel®® CoreCore™™ uuarcharchAdvanced Digital Media BoostAdvanced Digital Media Boost
A
B
Load eventuallystalls waiting formultiplier
Load pipeis free to advance
2x Compute Throughput / 2x Compute Throughput / ClockClock
45
4 Double Precision Flops/cycle4 Double Precision Flops/cycleAdvanced Digital Media BoostAdvanced Digital Media Boost
ExistingProcessor
IntelIntel®® CoreCore™™ uuarcharchAdvanced Digital Media BoostAdvanced Digital Media Boost
A
B
Load eventuallystalls waiting formultiplier
Load pipeis free to advance
2x Compute Throughput / 2x Compute Throughput / ClockClock
46
Advanced Digital Media BoostAdvanced Digital Media Boost
ExistingProcessor
IntelIntel®® CoreCore™™ uuarcharchAdvanced Digital Media BoostAdvanced Digital Media Boost
A
B
Load eventuallystalls waiting formultiplier
Load pipeis free to advance
2x Compute Throughput / 2x Compute Throughput / ClockClock
47
Advanced Digital Media BoostAdvanced Digital Media Boost
ExistingProcessor
IntelIntel®® CoreCore™™ uuarcharchAdvanced Digital Media BoostAdvanced Digital Media Boost
A
B
Load eventuallystalls waiting formultiplier
Load pipeis free to advance
2x Compute Throughput / 2x Compute Throughput / ClockClock
48
Advanced Digital Media BoostAdvanced Digital Media Boost
ExistingProcessor
IntelIntel®® CoreCore™™ uuarcharchAdvanced Digital Media BoostAdvanced Digital Media Boost
A
B
Load eventuallystalls waiting formultiplier
Load pipeis free to advance
2x Compute Throughput / 2x Compute Throughput / ClockClock
49
Advanced Digital Media BoostAdvanced Digital Media Boost
ExistingProcessor
IntelIntel®® CoreCore™™ uuarcharchAdvanced Digital Media BoostAdvanced Digital Media Boost
A
B
Load eventuallystalls waiting formultiplier
Load pipeis free to advance
2x Compute Throughput / 2x Compute Throughput / ClockClock
50
Advanced Digital Media BoostAdvanced Digital Media Boost
ExistingProcessor
IntelIntel®® CoreCore™™ uuarcharchAdvanced Digital Media BoostAdvanced Digital Media Boost
A
B
Load eventuallystalls waiting formultiplier
Load pipeis free to advance
2x Compute Throughput / 2x Compute Throughput / ClockClock
51
Advanced Digital Media BoostAdvanced Digital Media Boost
ExistingProcessor
IntelIntel®® CoreCore™™ uuarcharchAdvanced Digital Media BoostAdvanced Digital Media Boost
A
B
Load eventuallystalls waiting formultiplier
Load pipeis free to advance
Leading Compute DensityLeading Compute Density2x Compute Throughput / 2x Compute Throughput /
ClockClock
52
2M/4Mshared L2
Cache
up to10.4 Gb/s
FSB
L1 D-Cache and D-TLB
LoadLoad
SchedulersSchedulers
Retirement UnitRetirement Unit((ReOrderReOrder Buffer)Buffer)
ALUBranch
MMX/SSEFPmove
DecodeDecode
Rename/AllocRename/Alloc
uCodeuCodeROMROM
Instruction Fetch Instruction Fetch and and PreDecodePreDecode
ALUFAdd
MMX/SSEFPmove
ALUALUFMulFMul
MMX/SSEMMX/SSEFPmoveFPmove
Instruction QueueInstruction Queue
StoreStore
4444
4444
5555
Memory Memory DisambiguationDisambiguation
Improved Improved PrefetchersPrefetchers
Smart MemorySmart MemoryAccessAccess
Hiding Latency to Memory Hiding Latency to Memory SubsystemSubsystem
53
Smart Memory Access Smart Memory Access –– GoalGoal
L1Data
Cache
CORE1
L1Data
Cache
CORE2
Smart-SharedL2 Cache
System Bus
WHENWHEN Ensure data can be used as Ensure data can be used as earlyearly as possibleas possible
WHEREWHERE Ensure user of data has it as Ensure user of data has it as closeclose as possibleas possible
Hiding Latency to Memory Hiding Latency to Memory SubsystemSubsystem
54
Subsequent Loads Must WaitSubsequent Loads Must Wait
Memory
Store1 Load2Store3Load4
Data Y
Data Z
Data W
Data X
Without Memory DisambiguationWithout Memory Disambiguation
Load4 must WAIT until previous stores complete
Waits for Data X before can executeY
W
Y
X
12
4
3
55
Solving the Problem of Solving the Problem of WhenWhen
Memory
Store1 Load2Store3Load4
Data Y
Data Z
Data W
Data X
With IntelWith Intel’’s New Memory Disambiguations New Memory Disambiguation
Loads can decouple from Stores
Load4 can get its data FIRSTY
W
Y
X
23
4
1
Smart Memory AccessSmart Memory Access
56
Memory Disambiguation predictorMemory Disambiguation predictor––Loads that are predicted NOT to forward from Loads that are predicted NOT to forward from
preceding store are allowed to schedule as early preceding store are allowed to schedule as early as possibleas possible–– increasing the performance of OOO memory pipelinesincreasing the performance of OOO memory pipelines
Disambiguated loads checked at retirementDisambiguated loads checked at retirement––Extension to existing coherency mechanismExtension to existing coherency mechanism
––Invisible to software and systemInvisible to software and system
Memory DisambiguationMemory DisambiguationSmart Memory AccessSmart Memory Access
Hiding Latency to Memory Hiding Latency to Memory SubsystemSubsystem
57
Smart Memory AccessSmart Memory AccessPrefetchersPrefetchers
SharedL2
DataCache
oldest
youngest L1Data
Cache
Load1 Load2Load3Load4
58
Smart Memory Access: PrefetchersSmart Memory Access: Prefetchers
oldest
youngest
Memory is too far awayMemory is too far away
L1Data
Cache
Load1 Load2Load3Load4
SharedL2
DataCache
59
Smart Memory Access: PrefetchersSmart Memory Access: Prefetchers
oldest
youngest
Caches are closerCaches are closerwhen they have the datawhen they have the data
L1Data
Cache
Load1 Load2Load3Load4
SharedL2
DataCache
60
Smart Memory Access: PrefetchersSmart Memory Access: Prefetchers
oldest
youngest
Prefetchers detectPrefetchers detectapplications dataapplications datareference patternsreference patterns
L1Data
Cache
Load1 Load2Load3Load4
SharedL2
DataCache
61
Smart Memory Access: PrefetchersSmart Memory Access: Prefetchers
oldest
youngest
And bring the data And bring the data closer to data consumercloser to data consumer
L1Data
Cache
Load1 Load2Load3Load4
SharedL2
DataCache
62
Smart Memory Access: PrefetchersSmart Memory Access: Prefetchers
SharedL2
DataCache
oldest
youngest L1Data
Cache
Load1 Load2Load3Load4
Solving the Problem of Solving the Problem of Where Where Minimizing Memory LatencyMinimizing Memory Latency
63
Prefetchers and MultiPrefetchers and Multi--CoreCore
2M/4Mshared L2
Cache
up to10.4 Gb/s
FSB
L1 D-Cache and D-TLB
LoadLoad
SchedulersSchedulers
Retirement UnitRetirement Unit((ReOrderReOrder Buffer)Buffer)
ALUBranch
MMX/SSEFPmove
DecodeDecode
Rename/AllocRename/Alloc
uCodeuCodeROMROM
Instruction Fetch Instruction Fetch and and PreDecodePreDecode
ALUFAdd
MMX/SSEFPmove
ALUALUFMulFMul
MMX/SSEMMX/SSEFPmoveFPmove
Instruction QueueInstruction Queue
StoreStore
4444
4444
5555
64
Prefetchers and MultiPrefetchers and Multi--CoreCore
65
Prefetchers and MultiPrefetchers and Multi--CoreCore
2M/4Mshared L2
Cache
up to10.4 Gb/s
FSB
L1 D-Cache and D-TLB
LoadLoad
SchedulersSchedulers
Retirement UnitRetirement Unit((ReOrderReOrder Buffer)Buffer)
ALUBranch
MMX/SSEFPmove
DecodeDecode
Rename/AllocRename/Alloc
uCodeuCodeROMROM
Instruction Fetch Instruction Fetch and and PreDecodePreDecode
ALUFAdd
MMX/SSEFPmove
ALUALUFMulFMul
MMX/SSEMMX/SSEFPmoveFPmove
Instruction QueueInstruction Queue
StoreStore
4444
4444
5555
Three Individual Prefetchers per CoreThree Individual Prefetchers per CoreTwo L2 Prefetchers dynamically Two L2 Prefetchers dynamically
sharedshared
66
Smart Memory AccessSmart Memory Access8 Prefetchers per two8 Prefetchers per two--core processorcore processor–– 2 data and 1 instruction 2 data and 1 instruction prefetcherprefetcher per coreper core
–– able to handle multiple simultaneous patternsable to handle multiple simultaneous patterns–– 2 prefetchers in the L2 cache2 prefetchers in the L2 cache
–– tracking multiple patterns per coretracking multiple patterns per core
Prefetchers monitor demand traffic and regulate Prefetchers monitor demand traffic and regulate ““aggressionaggression””
Implementation Implementation ““knobsknobs”” allow platform and allow platform and segment specific settings tailored to applications segment specific settings tailored to applications and usage modelsand usage models
Data Is Data Is WhereWhere You Need It, You Need It, WhenWhen You Need ItYou Need It
67
Advanced Smart CacheAdvanced Smart CacheMultiMulti--core Optimized core Optimized
All the Smart Cache benefits:• L2 can adapt to each core’s load• Fast data sharing• No replicated data
Plus:• 2X BW to L1 caches
CoreCore22
CoreCore11
L2 CacheL2 Cache
Shared & MultiShared & Multi--Core Optimized, Core Optimized, with 2x Bandwidthwith 2x Bandwidth
68
Advanced Smart Cache Advanced Smart Cache Dynamic Cache AllocationDynamic Cache Allocation
Advanced Advanced Smart CacheSmart Cache
Independent Independent Cache Cache (today)(today)
Core1Core1 Core2Core2Core1Core1 Core2Core2
L2 CacheL2 Cache
Shared Cache adapts to mismatchedloads. Independent Cache can thrashheavy app even when other cache isunder-utilized
L2 L2 CacheCache
L2 L2 CacheCache
69
Advanced Smart Cache Advanced Smart Cache Efficient Data SharingEfficient Data Sharing
Independent CacheIndependent CacheAdvanced Smart Advanced Smart CacheCache
Core2Core2Core1Core1
L2 CacheL2 Cache
Core2Core2Core1Core1
L2 CacheL2 Cache
Main memory
L2 CacheL2 Cache
Main memoryFSBFSB FSBFSB
2X L2 to L1 Bandwidth2X L2 to L1 Bandwidth
70
Intelligent Power CapabilityIntelligent Power Capability
Extending the power management architecture• Intel® Pentium® M processor innovated a new power
management architecture• Intel® Core™ Duo extended the Pentium® M processor
capability to multi-core
New Power Features within each processor core• Ultra fine-grained power control• Split Busses• Platformization of Power Management Architecture
Enhancing Energy EfficiencyEnhancing Energy Efficiency
71
Ultra Fine Grained Power ControlUltra Fine Grained Power Control
Even during periods ofhigh performanceexecution, many partsof the chip core can beshut off.
Example could be aSW memory initializationexecuting from frontend with IQ operatingas loop cache.
ALUBranch
MMX/SSEFPmove
2M/4Mshared L2
Cache
up to10.4 Gb/s
FSB
L1 D-Cache and D-TLB
LoadLoad
SchedulersSchedulers
Retirement UnitRetirement Unit((ReOrderReOrder Buffer)Buffer)
DecodeDecode
Rename/AllocRename/Alloc
uCodeuCodeROMROM
Instruction Fetch Instruction Fetch and and PreDecodePreDecode
ALUFAdd
MMX/SSEFPmove
ALUALUFMulFMul
MMX/SSEMMX/SSEFPmoveFPmove
Instruction QueueInstruction Queue
StoreStore
4444
4444
5555
FP FP FP
Intelligent Power CapabilityIntelligent Power Capability
72
Intelligent Power CapabilityIntelligent Power CapabilitySplit Busses (core power feature)Split Busses (core power feature)
Many buses are sized for worst-case data
(x86 instruction of 15 bytes)(ALU can write-back 128 bits)
Improved Energy EfficiencyImproved Energy Efficiency
73
Intelligent Power CapabilityIntelligent Power CapabilitySplit Busses (core power feature)Split Busses (core power feature)
By splitting buses to dealwith varying data widths,
we can gain the performancebenefit of bus width while
maintaining C dynamiccloser to thinner buses
Improved Energy EfficiencyImproved Energy Efficiency
74
Platformization of Power Platformization of Power Management ArchitectureManagement Architecture
Integrating best features from Server Integrating best features from Server and Mobile productsand Mobile products
Exposing more to the systemExposing more to the system
PSIPSI--22 Power Status Indicator (Mobile)Power Status Indicator (Mobile)
DTSDTS Digital Thermal SensorsDigital Thermal Sensors
PECIPECI Platform Environment Control Platform Environment Control InterfaceInterface
75
Power Status Indicator Power Status Indicator (Mobile)(Mobile)
Processor communicates power consumption Processor communicates power consumption to external platform componentsto external platform components––Optimization of voltage regulator efficiencyOptimization of voltage regulator efficiency
––Load line and power delivery efficiency Load line and power delivery efficiency
PSI-2 / VID
VR
76
Enabling Efficient Processor and Enabling Efficient Processor and Platform Thermal ControlPlatform Thermal Control……
DTS DTS –– Digital Thermal SensorDigital Thermal Sensor
Several thermal sensors are located within the Several thermal sensors are located within the Processor to cover all possible hot spotsProcessor to cover all possible hot spots
Dedicated logic scans the thermal sensors and Dedicated logic scans the thermal sensors and measures the maximum temperature on the die at measures the maximum temperature on the die at any given timeany given time
Accurately reporting Processor temperature enables Accurately reporting Processor temperature enables advanced thermal control schemesadvanced thermal control schemes
LPF
LPF Core 1 DTS Logic
Core 2DTS Logic
DTS control
and status
77
Platform Environment Platform Environment Control Interface (PECI)Control Interface (PECI)
Processor provides its temperature reading over Processor provides its temperature reading over a a multi drop single wire busmulti drop single wire bus allowing efficient allowing efficient platform thermal controlplatform thermal control
ProcessorFan
AuxiliaryFan
Manager
PECI
ChassisFan 1
ChassisFan 2
PROC #2
PROC #3
PROC #1
78
2M/4Mshared L2
Cache
up to10.4 Gb/s
FSB
L1 D-Cache and D-TLB
LoadLoad
SchedulersSchedulers
Retirement UnitRetirement Unit((ReOrderReOrder Buffer)Buffer)
ALUBranch
MMX/SSEFPmove
DecodeDecode
Rename/AllocRename/Alloc
uCodeuCodeROMROM
Instruction Fetch Instruction Fetch and and PreDecodePreDecode
ALUFAdd
MMX/SSEFPmove
ALUALUFMulFMul
MMX/SSEMMX/SSEFPmoveFPmove
Instruction QueueInstruction Queue
StoreStore
4444
4444
5555
Microarchitecture Microarchitecture Feature SummaryFeature Summary
New, StateNew, State--ofof--thethe--Art, Art, MicroarchitectureMicroarchitecture
Wide Dynamic Execution33% wider pipes (4 vs. 3) and greater efficiency
Advanced Digital Media Boost2x compute throughput / clock
Smart Memory AccessMinimizing latency – Data Where & When needed
Advanced Smart CacheMulti-Core optimized, shared with 2x bandwidth
Intelligent Power CapabilityImproved energy efficientperformance
79
AgendaAgenda
–– MultiMulti--core Update and New Microcore Update and New Micro--architecture level set architecture level set
–– New IntelNew Intel®® CoreCore™™ MicroarchitectureMicroarchitecture–– Intel Microarchitecture HistoryIntel Microarchitecture History
–– IntelIntel®® CoreCore™™ Microarchitecture Design Goals and RoadmapMicroarchitecture Design Goals and Roadmap
–– Processor Architecture 101 Processor Architecture 101
–– IntelIntel®® CoreCore™™ MicroarchitectureMicroarchitecture
–– Software ImplicationsSoftware Implications
–– SummarySummary
80
IntelIntel®® CoreCore™™ Microarchitecture Microarchitecture and Softwareand Software
Software consistency across application spaceSoftware consistency across application space–– Wide Dynamic Execution will provide generic performance gains Wide Dynamic Execution will provide generic performance gains –– Smart Memory Access targets memory intensive appsSmart Memory Access targets memory intensive apps–– Advanced Digital Media Boost provides a leap in capability for Advanced Digital Media Boost provides a leap in capability for
media and floating point appsmedia and floating point apps–– MultiMulti--Core and Advanced Smart Cache further improve the Core and Advanced Smart Cache further improve the
growing number of multigrowing number of multi--threaded applicationsthreaded applications
Software consistency across markets segmentsSoftware consistency across markets segments––New apps and optimizations can target single New apps and optimizations can target single
microarchitecturemicroarchitecture
Immediate Performance Immediate Performance Increase Across Applications Increase Across Applications
and Segmentsand Segments
81
Agenda
• Multi-core Update and New Micro-architecture level set
• New Intel® Core™ Microarchitecture
• Summary
82
Summary
Continuing to drive aggressive multi-core ramp– Dual-core ramp in 2006, quad-core starts in early 2007
Intel® Core™ microarchitecture delivers leading performance and performance/watt
– Conroe – >40% performance increase1 / >40% less power– Woodcrest - >80% performance increase1 / >35% less power– Mobile - Extending leadership delivered with Intel® Core™ Duo
with >20% performance increase1
On track for product introductions starting in Q3’06– Based upon new Intel® Core™ microarchitecture
1 - Estimated SPECint* rate