82
Intel Intel ® ® Core Core Microarchitecture Microarchitecture March 8, 2006 March 8, 2006 NEW Stephen L. Smith Vice President Digital Enterprise Group Bob Valentine Architect Intel Architecture Group

N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

IntelIntel®® CoreCore™™MicroarchitectureMicroarchitecture

March 8, 2006March 8, 2006

NEW

Stephen L. SmithVice President

Digital Enterprise Group

Bob ValentineArchitect

Intel Architecture Group

Page 2: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

2

Agenda

• Multi-core Update and New Microarchitecture Level Set

• New Intel® Core™ Microarchitecture

• Wrap Up

Page 3: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

3

Intel Multi-core Roadmap – Updates since Fall IDF

Page 4: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

4

Ramping Multi-core Everywhere

All products and dates are preliminary and subject to change without notice.

Page 5: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

5

Refresher: What is Multi-Core?Two or more independent execution cores in the same processor

Specific implementations will vary over time - driven by product implementation and manufacturing efficiencies• Best mix of product architecture and volume mfg capabilities

– Architecture: Shared Caches vs. Independent Caches– Mfg capabilities: volume packaging technology

• Designed to deliver performance, OEM and end user experience

MultiMulti--Chip ProcessorChip ProcessorSingle die (Monolithic) based processorSingle die (Monolithic) based processorExample: 90nm PentiumExample: 90nm Pentium®® D D

Processor (Smithfield)Example: Intel CoreExample: Intel Core™™ Duo Duo

Processor (Yonah)Example: 65nm Pentium D Example: 65nm Pentium D

Processor (Presler)Processor (Smithfield) Processor (Yonah) Processor (Presler)

Front Side BusFront Side Bus

Core0Core0 Core1Core1 Core0Core0 Core1Core1

Front Side BusFront Side Bus

Core0Core0 Core1Core1

Front Side BusFront Side Bus

*Not representative of actual die photos or relative size

Page 6: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

6

Intel® Core™ Micro-architecture

*Not representative of actual die photo or relative size

Page 7: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

7

Intel Multi-core Roadmap

Page 8: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

8

Intel Multi-core Roadmap

Page 9: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

9

Intel® Core™ Microarchitecture Based PlatformsPlatformPlatform 2006 20072006 2007

MP ServersMP Servers

DP Servers/DP Servers/DP WorkstationDP Workstation

Mobile Client Mobile Client

Desktop Desktop --Home Home

Desktop Desktop --Office Office

Caneland Platform (2007) Tigerton (QC) (2007)

Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) )Woodcrest (Q3’06)

Clovertown (QC) (Q1’07)

Bridge Creek Platform (Mid’06)Conroe (Q3’06)

Kentsfield (QC) (Q1’07)

Kaylo Platform (Q3’06)/ Wyloway Platform (Q3 ’06)Conroe (Q3’06)

Kentsfield (QC) (Q1’07)UP Servers/UP Servers/UP WorkstationUP Workstation

Napa Platform (Q1’06)Merom (2H’06)

Averill Platform (Mid’06) Conroe (Q3’06)

All products and dates are preliminary and subject to change without notice.

Note: only Intel® Core™ microarchitecture based processors listed

QC refers to Quad-Core

Page 10: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

10

Intel® Core™ MicroarchitecturePerformance

Delivering both industry leading performance and performance/watt• Conroe: >40% improvement in performance1 & >40%

reduction in power2

– As compared to today’s high-end Pentium® D processor 950 (formerly Presler)

• Woodcrest: >80% improvement in performance1 and > 35% reduction in power2

– As compared to today’s high-end Dual-Core Intel® Xeon® processor 2.8GHz (formerly Paxville DP)

• Merom: Extends the already significant performance and performance/watt leadership delivered with today's Intel® Core™ Duo processor with greater than 20% additional performance1 improvement– As compared to today’s high-end Intel® Core™ Duo processor (formerly Yonah)

1 - Estimated SPECint*_rate_base2000

2 – Expected reduction in TDP

Page 11: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

11

Agenda

• Multi-core Update and New Micro-architecture level set

• New Intel® Core™ Microarchitecture

• Summary

Page 12: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

12

Inside the IntelInside the Intel®® CoreCore™™MicroarchitectureMicroarchitecture

Page 13: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

13

AgendaAgenda

–– MultiMulti--core Update and New Microcore Update and New Micro--architecture level set architecture level set

–– New IntelNew Intel®® CoreCore™™ MicroarchitectureMicroarchitecture–– Intel Microarchitecture HistoryIntel Microarchitecture History

–– IntelIntel®® CoreCore™™ Microarchitecture Design Goals and RoadmapMicroarchitecture Design Goals and Roadmap

–– Processor Architecture 101 Processor Architecture 101

–– IntelIntel®® CoreCore™™ MicroarchitectureMicroarchitecture

–– Software ImplicationsSoftware Implications

–– Wrap UpWrap Up

Page 14: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

14

Microarchitecture HistoryMicroarchitecture History

Page 15: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

15

New Microarchitecture Coming in 2006New Microarchitecture Coming in 2006

Page 16: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

16

AgendaAgenda

–– MultiMulti--core Update and New Microcore Update and New Micro--architecture level set architecture level set

–– New IntelNew Intel®® CoreCore™™ MicroarchitectureMicroarchitecture–– Intel Microarchitecture HistoryIntel Microarchitecture History

–– IntelIntel®® CoreCore™™ Microarchitecture Design GoalsMicroarchitecture Design Goals

–– Processor Architecture 101 Processor Architecture 101

–– IntelIntel®® CoreCore™™ MicroarchitectureMicroarchitecture

–– Software ImplicationsSoftware Implications

–– Wrap UpWrap Up

Page 17: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

17

IntelIntel®® CoreCore™™ Microarchitecture: Microarchitecture: Design GoalsDesign Goals

Deliver world class performance combined Deliver world class performance combined with superior energy/power efficiency with superior energy/power efficiency –– Existing and emerging applications and usesExisting and emerging applications and uses–– Greater performance and performance/wattGreater performance and performance/watt–– Optimized for Intel MultiOptimized for Intel Multi--core platformscore platforms

Deliver single foundation for optimized Deliver single foundation for optimized processors across each segment and power processors across each segment and power envelopeenvelope–– Optimized for mobile, desktop and server segmentsOptimized for mobile, desktop and server segments

Driving Performance and Driving Performance and Performance/Watt LeadershipPerformance/Watt Leadership

Page 18: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

18

AgendaAgenda

–– MultiMulti--core Update and New Microcore Update and New Micro--architecture level set architecture level set

–– New IntelNew Intel®® CoreCore™™ MicroarchitectureMicroarchitecture–– Intel Microarchitecture HistoryIntel Microarchitecture History

–– IntelIntel®® CoreCore™™ Microarchitecture Design Goals Microarchitecture Design Goals

–– Processor Architecture 101 Processor Architecture 101

–– IntelIntel®® CoreCore™™ MicroarchitectureMicroarchitecture

–– Software ImplicationsSoftware Implications

–– Wrap UpWrap Up

Page 19: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

19

Processor Architecture 101Processor Architecture 101

Delivered Performance = Frequency * Instructions Per Cycle (IPC)

Delivered Performance = Delivered Performance = Frequency * Instructions Per Cycle (IPC)Frequency * Instructions Per Cycle (IPC)

Goal is higher performance and lower power

Power α Cdynamic * V * V * FrequencyPower Power αα CCdynamicdynamic * V * V * Frequency* V * V * Frequency

Cdynamic is roughly a product of area and activity“how many bits” * “how much do they toggle”

Page 20: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

20

Processor Architecture 101Processor Architecture 101

Delivered Performance = Frequency * Instructions Per Cycle (IPC)

Delivered Performance = Delivered Performance = Frequency * Instructions Per Cycle (IPC)Frequency * Instructions Per Cycle (IPC)

Frequency is proportional to voltage,Frequency is proportional to voltage,so frequency reduction coupled with so frequency reduction coupled with voltage reduction results in cubic voltage reduction results in cubic reduction in power. reduction in power.

Power α Cdynamic * V * V * FrequencyPower Power αα CCdynamicdynamic * V * V * Frequency* V * V * Frequency

Page 21: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

21

Processor Architecture 101Processor Architecture 101

Delivered Performance = Frequency * Instructions Per Cycle (IPC)

Delivered Performance = Delivered Performance = Frequency * Instructions Per Cycle (IPC)Frequency * Instructions Per Cycle (IPC)

Higher IPC usuallyHigher IPC usuallyresults in wider data pathsresults in wider data pathsand/or more speculation :and/or more speculation :directly increasing C dynamicdirectly increasing C dynamic

Power α Cdynamic * V * V * FrequencyPower Power αα CCdynamicdynamic * V * V * Frequency* V * V * Frequency

Page 22: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

22

AgendaAgenda

–– MultiMulti--core Update and New Microcore Update and New Micro--architecture level set architecture level set

–– New IntelNew Intel®® CoreCore™™ MicroarchitectureMicroarchitecture–– Intel Microarchitecture HistoryIntel Microarchitecture History

–– IntelIntel®® CoreCore™™ Microarchitecture Design Goals Microarchitecture Design Goals

–– Processor Architecture 101 Processor Architecture 101

–– IntelIntel®® CoreCore™™ MicroarchitectureMicroarchitecture

–– Software ImplicationsSoftware Implications

–– Wrap UpWrap Up

Page 23: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

23

2M/4Mshared L2

Cache

up to10.4 Gb/s

FSB

L1 D-Cache and D-TLB

LoadLoad

SchedulersSchedulers

Retirement UnitRetirement Unit((ReOrderReOrder Buffer)Buffer)

ALUBranch

MMX/SSEFPmove

DecodeDecode

Rename/AllocRename/Alloc

uCodeuCodeROMROM

Instruction Fetch Instruction Fetch and and PreDecodePreDecode

ALUFAdd

MMX/SSEFPmove

ALUALUFMulFMul

MMX/SSEMMX/SSEFPmoveFPmove

Instruction QueueInstruction Queue

StoreStore

4444

4444

5555

IntelIntel®® CoreCore™™MicroarchitectureMicroarchitecture

Block Diagram Walkthrough

Page 24: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

24

2M/4Mshared L2

Cache

up to10.4 Gb/s

FSB

L1 D-Cache and D-TLB

LoadLoad

SchedulersSchedulers

Retirement UnitRetirement Unit((ReOrderReOrder Buffer)Buffer)

ALUBranch

MMX/SSEFPmove

DecodeDecode

Rename/AllocRename/Alloc

uCodeuCodeROMROM

Instruction Fetch Instruction Fetch and and PreDecodePreDecode

ALUFAdd

MMX/SSEFPmove

ALUALUFMulFMul

MMX/SSEMMX/SSEFPmoveFPmove

Instruction QueueInstruction Queue

StoreStore

4444

4444

5555

IntelIntel®® CoreCore™™MicroarchitectureMicroarchitecture

in orderin order

instruction fetchinstruction fetchinstruction decodeinstruction decodemicromicro--op renameop renamemicromicro--op allocateop allocate

Page 25: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

25

2M/4Mshared L2

Cache

up to10.4 Gb/s

FSB

L1 D-Cache and D-TLB

LoadLoad

SchedulersSchedulers

Retirement UnitRetirement Unit((ReOrderReOrder Buffer)Buffer)

ALUBranch

MMX/SSEFPmove

DecodeDecode

Rename/AllocRename/Alloc

uCodeuCodeROMROM

Instruction Fetch Instruction Fetch and and PreDecodePreDecode

ALUFAdd

MMX/SSEFPmove

ALUALUFMulFMul

MMX/SSEMMX/SSEFPmoveFPmove

Instruction QueueInstruction Queue

StoreStore

4444

4444

5555

IntelIntel®® CoreCore™™MicroarchitectureMicroarchitecture

out of orderout of order

micromicro--op scheduleop schedulemicromicro--op executeop execute

Page 26: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

26

2M/4Mshared L2

Cache

up to10.4 Gb/s

FSB

L1 D-Cache and D-TLB

LoadLoad

SchedulersSchedulers

Retirement UnitRetirement Unit((ReOrderReOrder Buffer)Buffer)

ALUBranch

MMX/SSEFPmove

DecodeDecode

Rename/AllocRename/Alloc

uCodeuCodeROMROM

Instruction Fetch Instruction Fetch and and PreDecodePreDecode

ALUFAdd

MMX/SSEFPmove

ALUALUFMulFMul

MMX/SSEMMX/SSEFPmoveFPmove

Instruction QueueInstruction Queue

StoreStore

4444

4444

5555

IntelIntel®® CoreCore™™MicroarchitectureMicroarchitecture

out of orderout of order

memory pipelinesmemory pipelines

memory order unitmemory order unitmaintains architecturalmaintains architecturalordering requirementsordering requirements

Page 27: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

27

2M/4Mshared L2

Cache

up to10.4 Gb/s

FSB

L1 D-Cache and D-TLB

LoadLoad

SchedulersSchedulers

Retirement UnitRetirement Unit((ReOrderReOrder Buffer)Buffer)

ALUBranch

MMX/SSEFPmove

DecodeDecode

Rename/AllocRename/Alloc

uCodeuCodeROMROM

Instruction Fetch Instruction Fetch and and PreDecodePreDecode

ALUFAdd

MMX/SSEFPmove

ALUALUFMulFMul

MMX/SSEMMX/SSEFPmoveFPmove

Instruction QueueInstruction Queue

StoreStore

4444

4444

5555

IntelIntel®® CoreCore™™MicroarchitectureMicroarchitecture

in orderin order

micromicro--op retirementop retirementfault handlingfault handling

Retirement UnitRetirement Unitmaintains illusionmaintains illusionof in orderof in orderinstruction retirementinstruction retirement

Page 28: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

28

IntelIntel®® CoreCore™™Microarchitecture

2M/4Mshared L2

Cache

up to10.4 Gb/s

FSB

L1 D-Cache and D-TLB

LoadLoad

SchedulersSchedulers

Retirement UnitRetirement Unit((ReOrderReOrder Buffer)Buffer)

ALUBranch

MMX/SSEFPmove

DecodeDecode

Rename/AllocRename/Alloc

uCodeuCodeROMROM

Instruction Fetch Instruction Fetch and and PreDecodePreDecode

ALUFAdd

MMX/SSEFPmove

ALUALUFMulFMul

MMX/SSEMMX/SSEFPmoveFPmove

Instruction QueueInstruction Queue

StoreStore

4444

4444

5555

Microarchitecture

Wide Dynamic Execution

Advanced Digital Media Boost

Smart Memory Access

Advanced Smart Cache

Intelligent Power Capability

New, StateNew, State--ofof--thethe--Art, Art, MicroarchitectureMicroarchitecture

Page 29: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

29

2M/4Mshared L2

Cache

up to10.4 Gb/s

FSB

L1 D-Cache and D-TLB

LoadLoad

SchedulersSchedulers

Retirement UnitRetirement Unit((ReOrderReOrder Buffer)Buffer)

ALUBranch

MMX/SSEFPmove

DecodeDecode

Rename/AllocRename/Alloc

uCodeuCodeROMROM

Instruction Fetch Instruction Fetch and and PreDecodePreDecode

ALUFAdd

MMX/SSEFPmove

ALUALUFMulFMul

MMX/SSEMMX/SSEFPmoveFPmove

Instruction QueueInstruction Queue

StoreStore

4444

4444

5555

Wide Dynamic Wide Dynamic ExecutionExecution

Start with Instruction Fetch

four(+) instructions / cycle

>33% increase over other x86 processors

Instructions converted to micro-ops (uops)

~1 uop per x86 instruction

Page 30: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

30

MicroMicro--op Reduction op Reduction (recall Processor (recall Processor 101)101)

Fewer Fewer uopsuops per instructionper instructionallows IPC to be increasedallows IPC to be increasedwhile lowering C dynamicwhile lowering C dynamic(less bits and less toggling)(less bits and less toggling)

Delivered Performance = Frequency * Instructions Per Cycle (IPC)

Delivered Performance = Delivered Performance = Frequency * Instructions Per Cycle (IPC)Frequency * Instructions Per Cycle (IPC)

Power = Cdynamic * V * V * FrequencyPower = Power = CCdynamicdynamic * V * V * Frequency* V * V * Frequency

Page 31: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

31

Techniques for MicroTechniques for Micro--op Reductionop Reduction

ESP Tracker (Extended Stack Pointer)ESP Tracker (Extended Stack Pointer)–– Execute Stack Pointer updates in dedicated hardwareExecute Stack Pointer updates in dedicated hardware–– IntelIntel®® CoreCore™™ microarchitecture increases BW by 33%*microarchitecture increases BW by 33%*

MicroMicro--Op MicroOp Micro--FusionFusion–– Single Single UopUop representation of representation of ““multimulti--uopuop”” instruction instruction –– IntelIntel®® CoreCore™™ microarchitecture increase # instructions*microarchitecture increase # instructions*

MacroMacro--FusionFusion–– New technique in IntelNew technique in Intel®® CoreCore™™ microarchitecture (more on microarchitecture (more on

next pages)next pages)

* Techniques pioneered on Intel* Techniques pioneered on Intel®® PentiumPentium®® M processorsM processors

Page 32: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

32

New: MacroNew: Macro--FusionFusion

Represent common x86 instruction pairs in Represent common x86 instruction pairs in single microsingle micro--opop––CMP or TEST + Conditional Branch (CMP or TEST + Conditional Branch (JccJcc))

Enhanced Arithmetic Logic Unit (ALU) for Enhanced Arithmetic Logic Unit (ALU) for macromacro--fusionfusion––Single dispatch Single dispatch -- efficiencyefficiency

––Single cycle execution Single cycle execution -- performanceperformance

Page 33: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

33

WithoutWithoutMacroMacro--FusionFusion

Instruction Queue

Read four instructions from Instruction Queue

Each instruction gets decoded into separate uops

store [mem3], ebx

load eax, [mem1]cmp eax, [mem2]

jne targ

inc ecx

inc ecx

store [mem3], ebx

dec1 dec2 dec3

jne targ

load eax, [mem1]

cmp eax, [mem2]

dec0

Cycle 1

Cycle 2

Page 34: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

34

With IntelWith Intel’’s New s New MacroMacro--FusionFusion

Read five Instructions from Instruction Queue

Send fusable pair to single decoder

Single uop represents two instructions

store [mem3], ebx

load eax, [mem1]cmpjne eax, [mem2], targ

inc ecx

dec1

Instruction Queue

inc ecx

dec2 dec3

load eax, [mem1]

cmp eax, [mem2]

jne targ

store [mem3], ebx

dec0

Cycle 1

Page 35: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

35

cmpjne eax, mem2, targScheduler

Execution

flags and target to Write back

BranchEval

MacroMacro--FusionFusion(cont)(cont)

Lower latencyIncreased bandwidth“virtually” increase storage

Macro-fusion makes the machine behave as if it is wider and deeper, withoutthe additional cost

Enabling Greater Performance & Enabling Greater Performance & EfficiencyEfficiency

Page 36: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

36

2M/4Mshared L2

Cache

up to10.4 Gb/s

FSB

L1 D-Cache and D-TLB

LoadLoad

SchedulersSchedulers

Retirement UnitRetirement Unit((ReOrderReOrder Buffer)Buffer)

ALUBranch

MMX/SSEFPmove

DecodeDecode

Rename/AllocRename/Alloc

uCodeuCodeROMROM

Instruction Fetch Instruction Fetch and and PreDecodePreDecode

ALUFAdd

MMX/SSEFPmove

ALUALUFMulFMul

MMX/SSEMMX/SSEFPmoveFPmove

Instruction QueueInstruction Queue

StoreStore

4444

4444

5555

Wide Dynamic Wide Dynamic ExecutionExecution

4 wide rename4 wide micro-op execution 4 wide retire

Deeper out of order storage

32 discontiguous micro-opsconsidered for dispatch per cycle

33% Wider Than Previous 33% Wider Than Previous GenerationGeneration

Page 37: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

37

2M/4Mshared L2

Cache

up to10.4 Gb/s

FSB

L1 D-Cache and D-TLB

LoadLoad

SchedulersSchedulers

Retirement UnitRetirement Unit((ReOrderReOrder Buffer)Buffer)

ALUBranch

MMX/SSEFPmove

DecodeDecode

Rename/AllocRename/Alloc

uCodeuCodeROMROM

Instruction Fetch Instruction Fetch and and PreDecodePreDecode

ALUFAdd

MMX/SSEFPmove

ALUALUFMulFMul

MMX/SSEMMX/SSEFPmoveFPmove

Instruction QueueInstruction Queue

StoreStore

4444

4444

5555

Advanced Digital Advanced Digital Media BoostMedia Boost

128-bit packed Multiplyplus

128-bit packed Addplus

128-bit packed Loadplus

128-bit packed Storeplus

(how about a CMPJCC)

2x Compute Throughput / 2x Compute Throughput / ClockClock

Page 38: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

38

Advanced Digital Media BoostAdvanced Digital Media Boost

Lets scale a vector: Lets scale a vector: B[iB[i] := ] := A[iA[i] * C] * C

A

B

ExistingProcessor

IntelIntel®® CoreCore™™ uuarcharchAdvanced Digital Media BoostAdvanced Digital Media Boost

2x Compute Throughput / 2x Compute Throughput / ClockClock

Page 39: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

39

ExistingProcessor

IntelIntel®® CoreCore™™ uuarcharchAdvanced Digital Media BoostAdvanced Digital Media Boost

Assume both Microarchitectures have 128Assume both Microarchitectures have 128--bit bit path from L1 to Processorpath from L1 to Processor

Advanced Digital Media BoostAdvanced Digital Media Boost

A

B

2x Compute Throughput / 2x Compute Throughput / ClockClock

Page 40: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

40

Advanced Digital Media BoostAdvanced Digital Media Boost

...handles all the memory data...handles all the memory data

A

B

ExistingProcessor

IntelIntel®® CoreCore™™ uuarcharchAdvanced Digital Media BoostAdvanced Digital Media Boost

Multiply can’tkeep up with load bandwidth

multiplieroperateson all data

2x Compute Throughput / 2x Compute Throughput / ClockClock

Page 41: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

41

Advanced Digital Media BoostAdvanced Digital Media BoostExisting implementations eventually stall the load Existing implementations eventually stall the load

pipe waiting for multiplierpipe waiting for multiplier

ExistingProcessor

IntelIntel®® CoreCore™™ uuarcharchAdvanced Digital Media BoostAdvanced Digital Media Boost

A

B

Load eventuallystalls waiting formultiplier

Load pipeis free to advance

2x Compute Throughput / 2x Compute Throughput / ClockClock

Page 42: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

42

Advanced Digital Media BoostAdvanced Digital Media Boost

...keeps pipeline free for computations...keeps pipeline free for computations

ExistingProcessor

IntelIntel®® CoreCore™™ uuarcharchAdvanced Digital Media BoostAdvanced Digital Media Boost

A

B

Load eventuallystalls waiting formultiplier

Load pipeis free to advance

2x Compute Throughput / 2x Compute Throughput / ClockClock

Page 43: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

43

Advanced Digital Media BoostAdvanced Digital Media Boost...maintains 2X throughput compared to prior ...maintains 2X throughput compared to prior

implementationsimplementations

ExistingProcessor

IntelIntel®® CoreCore™™ uuarcharchAdvanced Digital Media BoostAdvanced Digital Media Boost

A

B

Load eventuallystalls waiting formultiplier

Load pipeis free to advance

2x Compute Throughput / 2x Compute Throughput / ClockClock

Page 44: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

44

Advanced Digital Media BoostAdvanced Digital Media Boost

8 Single Precision Flops/cycle8 Single Precision Flops/cycle

ExistingProcessor

IntelIntel®® CoreCore™™ uuarcharchAdvanced Digital Media BoostAdvanced Digital Media Boost

A

B

Load eventuallystalls waiting formultiplier

Load pipeis free to advance

2x Compute Throughput / 2x Compute Throughput / ClockClock

Page 45: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

45

4 Double Precision Flops/cycle4 Double Precision Flops/cycleAdvanced Digital Media BoostAdvanced Digital Media Boost

ExistingProcessor

IntelIntel®® CoreCore™™ uuarcharchAdvanced Digital Media BoostAdvanced Digital Media Boost

A

B

Load eventuallystalls waiting formultiplier

Load pipeis free to advance

2x Compute Throughput / 2x Compute Throughput / ClockClock

Page 46: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

46

Advanced Digital Media BoostAdvanced Digital Media Boost

ExistingProcessor

IntelIntel®® CoreCore™™ uuarcharchAdvanced Digital Media BoostAdvanced Digital Media Boost

A

B

Load eventuallystalls waiting formultiplier

Load pipeis free to advance

2x Compute Throughput / 2x Compute Throughput / ClockClock

Page 47: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

47

Advanced Digital Media BoostAdvanced Digital Media Boost

ExistingProcessor

IntelIntel®® CoreCore™™ uuarcharchAdvanced Digital Media BoostAdvanced Digital Media Boost

A

B

Load eventuallystalls waiting formultiplier

Load pipeis free to advance

2x Compute Throughput / 2x Compute Throughput / ClockClock

Page 48: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

48

Advanced Digital Media BoostAdvanced Digital Media Boost

ExistingProcessor

IntelIntel®® CoreCore™™ uuarcharchAdvanced Digital Media BoostAdvanced Digital Media Boost

A

B

Load eventuallystalls waiting formultiplier

Load pipeis free to advance

2x Compute Throughput / 2x Compute Throughput / ClockClock

Page 49: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

49

Advanced Digital Media BoostAdvanced Digital Media Boost

ExistingProcessor

IntelIntel®® CoreCore™™ uuarcharchAdvanced Digital Media BoostAdvanced Digital Media Boost

A

B

Load eventuallystalls waiting formultiplier

Load pipeis free to advance

2x Compute Throughput / 2x Compute Throughput / ClockClock

Page 50: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

50

Advanced Digital Media BoostAdvanced Digital Media Boost

ExistingProcessor

IntelIntel®® CoreCore™™ uuarcharchAdvanced Digital Media BoostAdvanced Digital Media Boost

A

B

Load eventuallystalls waiting formultiplier

Load pipeis free to advance

2x Compute Throughput / 2x Compute Throughput / ClockClock

Page 51: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

51

Advanced Digital Media BoostAdvanced Digital Media Boost

ExistingProcessor

IntelIntel®® CoreCore™™ uuarcharchAdvanced Digital Media BoostAdvanced Digital Media Boost

A

B

Load eventuallystalls waiting formultiplier

Load pipeis free to advance

Leading Compute DensityLeading Compute Density2x Compute Throughput / 2x Compute Throughput /

ClockClock

Page 52: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

52

2M/4Mshared L2

Cache

up to10.4 Gb/s

FSB

L1 D-Cache and D-TLB

LoadLoad

SchedulersSchedulers

Retirement UnitRetirement Unit((ReOrderReOrder Buffer)Buffer)

ALUBranch

MMX/SSEFPmove

DecodeDecode

Rename/AllocRename/Alloc

uCodeuCodeROMROM

Instruction Fetch Instruction Fetch and and PreDecodePreDecode

ALUFAdd

MMX/SSEFPmove

ALUALUFMulFMul

MMX/SSEMMX/SSEFPmoveFPmove

Instruction QueueInstruction Queue

StoreStore

4444

4444

5555

Memory Memory DisambiguationDisambiguation

Improved Improved PrefetchersPrefetchers

Smart MemorySmart MemoryAccessAccess

Hiding Latency to Memory Hiding Latency to Memory SubsystemSubsystem

Page 53: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

53

Smart Memory Access Smart Memory Access –– GoalGoal

L1Data

Cache

CORE1

L1Data

Cache

CORE2

Smart-SharedL2 Cache

System Bus

WHENWHEN Ensure data can be used as Ensure data can be used as earlyearly as possibleas possible

WHEREWHERE Ensure user of data has it as Ensure user of data has it as closeclose as possibleas possible

Hiding Latency to Memory Hiding Latency to Memory SubsystemSubsystem

Page 54: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

54

Subsequent Loads Must WaitSubsequent Loads Must Wait

Memory

Store1 Load2Store3Load4

Data Y

Data Z

Data W

Data X

Without Memory DisambiguationWithout Memory Disambiguation

Load4 must WAIT until previous stores complete

Waits for Data X before can executeY

W

Y

X

12

4

3

Page 55: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

55

Solving the Problem of Solving the Problem of WhenWhen

Memory

Store1 Load2Store3Load4

Data Y

Data Z

Data W

Data X

With IntelWith Intel’’s New Memory Disambiguations New Memory Disambiguation

Loads can decouple from Stores

Load4 can get its data FIRSTY

W

Y

X

23

4

1

Smart Memory AccessSmart Memory Access

Page 56: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

56

Memory Disambiguation predictorMemory Disambiguation predictor––Loads that are predicted NOT to forward from Loads that are predicted NOT to forward from

preceding store are allowed to schedule as early preceding store are allowed to schedule as early as possibleas possible–– increasing the performance of OOO memory pipelinesincreasing the performance of OOO memory pipelines

Disambiguated loads checked at retirementDisambiguated loads checked at retirement––Extension to existing coherency mechanismExtension to existing coherency mechanism

––Invisible to software and systemInvisible to software and system

Memory DisambiguationMemory DisambiguationSmart Memory AccessSmart Memory Access

Hiding Latency to Memory Hiding Latency to Memory SubsystemSubsystem

Page 57: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

57

Smart Memory AccessSmart Memory AccessPrefetchersPrefetchers

SharedL2

DataCache

oldest

youngest L1Data

Cache

Load1 Load2Load3Load4

Page 58: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

58

Smart Memory Access: PrefetchersSmart Memory Access: Prefetchers

oldest

youngest

Memory is too far awayMemory is too far away

L1Data

Cache

Load1 Load2Load3Load4

SharedL2

DataCache

Page 59: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

59

Smart Memory Access: PrefetchersSmart Memory Access: Prefetchers

oldest

youngest

Caches are closerCaches are closerwhen they have the datawhen they have the data

L1Data

Cache

Load1 Load2Load3Load4

SharedL2

DataCache

Page 60: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

60

Smart Memory Access: PrefetchersSmart Memory Access: Prefetchers

oldest

youngest

Prefetchers detectPrefetchers detectapplications dataapplications datareference patternsreference patterns

L1Data

Cache

Load1 Load2Load3Load4

SharedL2

DataCache

Page 61: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

61

Smart Memory Access: PrefetchersSmart Memory Access: Prefetchers

oldest

youngest

And bring the data And bring the data closer to data consumercloser to data consumer

L1Data

Cache

Load1 Load2Load3Load4

SharedL2

DataCache

Page 62: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

62

Smart Memory Access: PrefetchersSmart Memory Access: Prefetchers

SharedL2

DataCache

oldest

youngest L1Data

Cache

Load1 Load2Load3Load4

Solving the Problem of Solving the Problem of Where Where Minimizing Memory LatencyMinimizing Memory Latency

Page 63: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

63

Prefetchers and MultiPrefetchers and Multi--CoreCore

2M/4Mshared L2

Cache

up to10.4 Gb/s

FSB

L1 D-Cache and D-TLB

LoadLoad

SchedulersSchedulers

Retirement UnitRetirement Unit((ReOrderReOrder Buffer)Buffer)

ALUBranch

MMX/SSEFPmove

DecodeDecode

Rename/AllocRename/Alloc

uCodeuCodeROMROM

Instruction Fetch Instruction Fetch and and PreDecodePreDecode

ALUFAdd

MMX/SSEFPmove

ALUALUFMulFMul

MMX/SSEMMX/SSEFPmoveFPmove

Instruction QueueInstruction Queue

StoreStore

4444

4444

5555

Page 64: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

64

Prefetchers and MultiPrefetchers and Multi--CoreCore

Page 65: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

65

Prefetchers and MultiPrefetchers and Multi--CoreCore

2M/4Mshared L2

Cache

up to10.4 Gb/s

FSB

L1 D-Cache and D-TLB

LoadLoad

SchedulersSchedulers

Retirement UnitRetirement Unit((ReOrderReOrder Buffer)Buffer)

ALUBranch

MMX/SSEFPmove

DecodeDecode

Rename/AllocRename/Alloc

uCodeuCodeROMROM

Instruction Fetch Instruction Fetch and and PreDecodePreDecode

ALUFAdd

MMX/SSEFPmove

ALUALUFMulFMul

MMX/SSEMMX/SSEFPmoveFPmove

Instruction QueueInstruction Queue

StoreStore

4444

4444

5555

Three Individual Prefetchers per CoreThree Individual Prefetchers per CoreTwo L2 Prefetchers dynamically Two L2 Prefetchers dynamically

sharedshared

Page 66: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

66

Smart Memory AccessSmart Memory Access8 Prefetchers per two8 Prefetchers per two--core processorcore processor–– 2 data and 1 instruction 2 data and 1 instruction prefetcherprefetcher per coreper core

–– able to handle multiple simultaneous patternsable to handle multiple simultaneous patterns–– 2 prefetchers in the L2 cache2 prefetchers in the L2 cache

–– tracking multiple patterns per coretracking multiple patterns per core

Prefetchers monitor demand traffic and regulate Prefetchers monitor demand traffic and regulate ““aggressionaggression””

Implementation Implementation ““knobsknobs”” allow platform and allow platform and segment specific settings tailored to applications segment specific settings tailored to applications and usage modelsand usage models

Data Is Data Is WhereWhere You Need It, You Need It, WhenWhen You Need ItYou Need It

Page 67: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

67

Advanced Smart CacheAdvanced Smart CacheMultiMulti--core Optimized core Optimized

All the Smart Cache benefits:• L2 can adapt to each core’s load• Fast data sharing• No replicated data

Plus:• 2X BW to L1 caches

CoreCore22

CoreCore11

L2 CacheL2 Cache

Shared & MultiShared & Multi--Core Optimized, Core Optimized, with 2x Bandwidthwith 2x Bandwidth

Page 68: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

68

Advanced Smart Cache Advanced Smart Cache Dynamic Cache AllocationDynamic Cache Allocation

Advanced Advanced Smart CacheSmart Cache

Independent Independent Cache Cache (today)(today)

Core1Core1 Core2Core2Core1Core1 Core2Core2

L2 CacheL2 Cache

Shared Cache adapts to mismatchedloads. Independent Cache can thrashheavy app even when other cache isunder-utilized

L2 L2 CacheCache

L2 L2 CacheCache

Page 69: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

69

Advanced Smart Cache Advanced Smart Cache Efficient Data SharingEfficient Data Sharing

Independent CacheIndependent CacheAdvanced Smart Advanced Smart CacheCache

Core2Core2Core1Core1

L2 CacheL2 Cache

Core2Core2Core1Core1

L2 CacheL2 Cache

Main memory

L2 CacheL2 Cache

Main memoryFSBFSB FSBFSB

2X L2 to L1 Bandwidth2X L2 to L1 Bandwidth

Page 70: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

70

Intelligent Power CapabilityIntelligent Power Capability

Extending the power management architecture• Intel® Pentium® M processor innovated a new power

management architecture• Intel® Core™ Duo extended the Pentium® M processor

capability to multi-core

New Power Features within each processor core• Ultra fine-grained power control• Split Busses• Platformization of Power Management Architecture

Enhancing Energy EfficiencyEnhancing Energy Efficiency

Page 71: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

71

Ultra Fine Grained Power ControlUltra Fine Grained Power Control

Even during periods ofhigh performanceexecution, many partsof the chip core can beshut off.

Example could be aSW memory initializationexecuting from frontend with IQ operatingas loop cache.

ALUBranch

MMX/SSEFPmove

2M/4Mshared L2

Cache

up to10.4 Gb/s

FSB

L1 D-Cache and D-TLB

LoadLoad

SchedulersSchedulers

Retirement UnitRetirement Unit((ReOrderReOrder Buffer)Buffer)

DecodeDecode

Rename/AllocRename/Alloc

uCodeuCodeROMROM

Instruction Fetch Instruction Fetch and and PreDecodePreDecode

ALUFAdd

MMX/SSEFPmove

ALUALUFMulFMul

MMX/SSEMMX/SSEFPmoveFPmove

Instruction QueueInstruction Queue

StoreStore

4444

4444

5555

FP FP FP

Intelligent Power CapabilityIntelligent Power Capability

Page 72: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

72

Intelligent Power CapabilityIntelligent Power CapabilitySplit Busses (core power feature)Split Busses (core power feature)

Many buses are sized for worst-case data

(x86 instruction of 15 bytes)(ALU can write-back 128 bits)

Improved Energy EfficiencyImproved Energy Efficiency

Page 73: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

73

Intelligent Power CapabilityIntelligent Power CapabilitySplit Busses (core power feature)Split Busses (core power feature)

By splitting buses to dealwith varying data widths,

we can gain the performancebenefit of bus width while

maintaining C dynamiccloser to thinner buses

Improved Energy EfficiencyImproved Energy Efficiency

Page 74: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

74

Platformization of Power Platformization of Power Management ArchitectureManagement Architecture

Integrating best features from Server Integrating best features from Server and Mobile productsand Mobile products

Exposing more to the systemExposing more to the system

PSIPSI--22 Power Status Indicator (Mobile)Power Status Indicator (Mobile)

DTSDTS Digital Thermal SensorsDigital Thermal Sensors

PECIPECI Platform Environment Control Platform Environment Control InterfaceInterface

Page 75: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

75

Power Status Indicator Power Status Indicator (Mobile)(Mobile)

Processor communicates power consumption Processor communicates power consumption to external platform componentsto external platform components––Optimization of voltage regulator efficiencyOptimization of voltage regulator efficiency

––Load line and power delivery efficiency Load line and power delivery efficiency

PSI-2 / VID

VR

Page 76: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

76

Enabling Efficient Processor and Enabling Efficient Processor and Platform Thermal ControlPlatform Thermal Control……

DTS DTS –– Digital Thermal SensorDigital Thermal Sensor

Several thermal sensors are located within the Several thermal sensors are located within the Processor to cover all possible hot spotsProcessor to cover all possible hot spots

Dedicated logic scans the thermal sensors and Dedicated logic scans the thermal sensors and measures the maximum temperature on the die at measures the maximum temperature on the die at any given timeany given time

Accurately reporting Processor temperature enables Accurately reporting Processor temperature enables advanced thermal control schemesadvanced thermal control schemes

LPF

LPF Core 1 DTS Logic

Core 2DTS Logic

DTS control

and status

Page 77: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

77

Platform Environment Platform Environment Control Interface (PECI)Control Interface (PECI)

Processor provides its temperature reading over Processor provides its temperature reading over a a multi drop single wire busmulti drop single wire bus allowing efficient allowing efficient platform thermal controlplatform thermal control

ProcessorFan

AuxiliaryFan

Manager

PECI

ChassisFan 1

ChassisFan 2

PROC #2

PROC #3

PROC #1

Page 78: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

78

2M/4Mshared L2

Cache

up to10.4 Gb/s

FSB

L1 D-Cache and D-TLB

LoadLoad

SchedulersSchedulers

Retirement UnitRetirement Unit((ReOrderReOrder Buffer)Buffer)

ALUBranch

MMX/SSEFPmove

DecodeDecode

Rename/AllocRename/Alloc

uCodeuCodeROMROM

Instruction Fetch Instruction Fetch and and PreDecodePreDecode

ALUFAdd

MMX/SSEFPmove

ALUALUFMulFMul

MMX/SSEMMX/SSEFPmoveFPmove

Instruction QueueInstruction Queue

StoreStore

4444

4444

5555

Microarchitecture Microarchitecture Feature SummaryFeature Summary

New, StateNew, State--ofof--thethe--Art, Art, MicroarchitectureMicroarchitecture

Wide Dynamic Execution33% wider pipes (4 vs. 3) and greater efficiency

Advanced Digital Media Boost2x compute throughput / clock

Smart Memory AccessMinimizing latency – Data Where & When needed

Advanced Smart CacheMulti-Core optimized, shared with 2x bandwidth

Intelligent Power CapabilityImproved energy efficientperformance

Page 79: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

79

AgendaAgenda

–– MultiMulti--core Update and New Microcore Update and New Micro--architecture level set architecture level set

–– New IntelNew Intel®® CoreCore™™ MicroarchitectureMicroarchitecture–– Intel Microarchitecture HistoryIntel Microarchitecture History

–– IntelIntel®® CoreCore™™ Microarchitecture Design Goals and RoadmapMicroarchitecture Design Goals and Roadmap

–– Processor Architecture 101 Processor Architecture 101

–– IntelIntel®® CoreCore™™ MicroarchitectureMicroarchitecture

–– Software ImplicationsSoftware Implications

–– SummarySummary

Page 80: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

80

IntelIntel®® CoreCore™™ Microarchitecture Microarchitecture and Softwareand Software

Software consistency across application spaceSoftware consistency across application space–– Wide Dynamic Execution will provide generic performance gains Wide Dynamic Execution will provide generic performance gains –– Smart Memory Access targets memory intensive appsSmart Memory Access targets memory intensive apps–– Advanced Digital Media Boost provides a leap in capability for Advanced Digital Media Boost provides a leap in capability for

media and floating point appsmedia and floating point apps–– MultiMulti--Core and Advanced Smart Cache further improve the Core and Advanced Smart Cache further improve the

growing number of multigrowing number of multi--threaded applicationsthreaded applications

Software consistency across markets segmentsSoftware consistency across markets segments––New apps and optimizations can target single New apps and optimizations can target single

microarchitecturemicroarchitecture

Immediate Performance Immediate Performance Increase Across Applications Increase Across Applications

and Segmentsand Segments

Page 81: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

81

Agenda

• Multi-core Update and New Micro-architecture level set

• New Intel® Core™ Microarchitecture

• Summary

Page 82: N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

82

Summary

Continuing to drive aggressive multi-core ramp– Dual-core ramp in 2006, quad-core starts in early 2007

Intel® Core™ microarchitecture delivers leading performance and performance/watt

– Conroe – >40% performance increase1 / >40% less power– Woodcrest - >80% performance increase1 / >35% less power– Mobile - Extending leadership delivered with Intel® Core™ Duo

with >20% performance increase1

On track for product introductions starting in Q3’06– Based upon new Intel® Core™ microarchitecture

1 - Estimated SPECint* rate