N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth

IntelIntel®® CoreCore™™MicroarchitectureMicroarchitecture

March 8, 2006March 8, 2006

NEW

Stephen L. SmithVice President

Digital Enterprise Group

Bob ValentineArchitect

Intel Architecture Group

2

Agenda

• Multi-core Update and New Microarchitecture Level Set

• New Intel® Core™ Microarchitecture

• Wrap Up

3

Intel Multi-core Roadmap – Updates since Fall IDF

4

Ramping Multi-core Everywhere

All products and dates are preliminary and subject to change without notice.

5

Refresher: What is Multi-Core?Two or more independent execution cores in the same processor

Specific implementations will vary over time - driven by product implementation and manufacturing efficiencies• Best mix of product architecture and volume mfg capabilities

– Architecture: Shared Caches vs. Independent Caches– Mfg capabilities: volume packaging technology

• Designed to deliver performance, OEM and end user experience

MultiMulti--Chip ProcessorChip ProcessorSingle die (Monolithic) based processorSingle die (Monolithic) based processorExample: 90nm PentiumExample: 90nm Pentium®® D D

Processor (Smithfield)Example: Intel CoreExample: Intel Core™™ Duo Duo

Processor (Yonah)Example: 65nm Pentium D Example: 65nm Pentium D

Processor (Presler)Processor (Smithfield) Processor (Yonah) Processor (Presler)

Front Side BusFront Side Bus

Core0Core0 Core1Core1 Core0Core0 Core1Core1


Core0Core0 Core1Core1


*Not representative of actual die photos or relative size

6

Intel® Core™ Micro-architecture

*Not representative of actual die photo or relative size

7

Intel Multi-core Roadmap

8

Intel Multi-core Roadmap

9

Intel® Core™ Microarchitecture Based PlatformsPlatformPlatform 2006 20072006 2007

MP ServersMP Servers

DP Servers/DP Servers/DP WorkstationDP Workstation

Mobile Client Mobile Client

Desktop Desktop --Home Home

Desktop Desktop --Office Office

Caneland Platform (2007) Tigerton (QC) (2007)

Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) )Woodcrest (Q3’06)

Clovertown (QC) (Q1’07)

Bridge Creek Platform (Mid’06)Conroe (Q3’06)

Kentsfield (QC) (Q1’07)

Kaylo Platform (Q3’06)/ Wyloway Platform (Q3 ’06)Conroe (Q3’06)

Kentsfield (QC) (Q1’07)UP Servers/UP Servers/UP WorkstationUP Workstation

Napa Platform (Q1’06)Merom (2H’06)

Averill Platform (Mid’06) Conroe (Q3’06)

All products and dates are preliminary and subject to change without notice.

Note: only Intel® Core™ microarchitecture based processors listed

QC refers to Quad-Core

10

Intel® Core™ MicroarchitecturePerformance

Delivering both industry leading performance and performance/watt• Conroe: >40% improvement in performance1 & >40%

reduction in power2

– As compared to today’s high-end Pentium® D processor 950 (formerly Presler)

• Woodcrest: >80% improvement in performance1 and > 35% reduction in power2

– As compared to today’s high-end Dual-Core Intel® Xeon® processor 2.8GHz (formerly Paxville DP)

• Merom: Extends the already significant performance and performance/watt leadership delivered with today's Intel® Core™ Duo processor with greater than 20% additional performance1 improvement– As compared to today’s high-end Intel® Core™ Duo processor (formerly Yonah)

1 - Estimated SPECint*_rate_base2000

2 – Expected reduction in TDP

11

Agenda

• Multi-core Update and New Micro-architecture level set


• Summary

12

Inside the IntelInside the Intel®® CoreCore™™MicroarchitectureMicroarchitecture

13

AgendaAgenda

–– MultiMulti--core Update and New Microcore Update and New Micro--architecture level set architecture level set

–– New IntelNew Intel®® CoreCore™™ MicroarchitectureMicroarchitecture–– Intel Microarchitecture HistoryIntel Microarchitecture History

–– IntelIntel®® CoreCore™™ Microarchitecture Design Goals and RoadmapMicroarchitecture Design Goals and Roadmap

–– Processor Architecture 101 Processor Architecture 101

–– IntelIntel®® CoreCore™™ MicroarchitectureMicroarchitecture

–– Software ImplicationsSoftware Implications

–– Wrap UpWrap Up

14

Microarchitecture HistoryMicroarchitecture History

15

New Microarchitecture Coming in 2006New Microarchitecture Coming in 2006

16

AgendaAgenda



–– IntelIntel®® CoreCore™™ Microarchitecture Design GoalsMicroarchitecture Design Goals





17

IntelIntel®® CoreCore™™ Microarchitecture: Microarchitecture: Design GoalsDesign Goals

Deliver world class performance combined Deliver world class performance combined with superior energy/power efficiency with superior energy/power efficiency –– Existing and emerging applications and usesExisting and emerging applications and uses–– Greater performance and performance/wattGreater performance and performance/watt–– Optimized for Intel MultiOptimized for Intel Multi--core platformscore platforms

Deliver single foundation for optimized Deliver single foundation for optimized processors across each segment and power processors across each segment and power envelopeenvelope–– Optimized for mobile, desktop and server segmentsOptimized for mobile, desktop and server segments

Driving Performance and Driving Performance and Performance/Watt LeadershipPerformance/Watt Leadership

18

AgendaAgenda



–– IntelIntel®® CoreCore™™ Microarchitecture Design Goals Microarchitecture Design Goals





19

Processor Architecture 101Processor Architecture 101

Delivered Performance = Frequency * Instructions Per Cycle (IPC)

Delivered Performance = Delivered Performance = Frequency * Instructions Per Cycle (IPC)Frequency * Instructions Per Cycle (IPC)

Goal is higher performance and lower power

Power α Cdynamic * V * V * FrequencyPower Power αα CCdynamicdynamic * V * V * Frequency* V * V * Frequency

Cdynamic is roughly a product of area and activity“how many bits” * “how much do they toggle”

20




Frequency is proportional to voltage,Frequency is proportional to voltage,so frequency reduction coupled with so frequency reduction coupled with voltage reduction results in cubic voltage reduction results in cubic reduction in power. reduction in power.


21




Higher IPC usuallyHigher IPC usuallyresults in wider data pathsresults in wider data pathsand/or more speculation :and/or more speculation :directly increasing C dynamicdirectly increasing C dynamic


22

AgendaAgenda



–– IntelIntel®® CoreCore™™ Microarchitecture Design Goals Microarchitecture Design Goals





23

2M/4Mshared L2

Cache

up to10.4 Gb/s

FSB

L1 D-Cache and D-TLB

LoadLoad

SchedulersSchedulers

Retirement UnitRetirement Unit((ReOrderReOrder Buffer)Buffer)

ALUBranch

MMX/SSEFPmove

DecodeDecode

Rename/AllocRename/Alloc

uCodeuCodeROMROM

Instruction Fetch Instruction Fetch and and PreDecodePreDecode

ALUFAdd

MMX/SSEFPmove

ALUALUFMulFMul

MMX/SSEMMX/SSEFPmoveFPmove

Instruction QueueInstruction Queue

StoreStore

4444

4444

5555


Block Diagram Walkthrough

24

2M/4Mshared L2

Cache

up to10.4 Gb/s

FSB


LoadLoad



ALUBranch

MMX/SSEFPmove

DecodeDecode


uCodeuCodeROMROM


ALUFAdd

MMX/SSEFPmove

ALUALUFMulFMul



StoreStore

4444

4444

5555


in orderin order

instruction fetchinstruction fetchinstruction decodeinstruction decodemicromicro--op renameop renamemicromicro--op allocateop allocate

25

2M/4Mshared L2

Cache

up to10.4 Gb/s

FSB


LoadLoad



ALUBranch

MMX/SSEFPmove

DecodeDecode


uCodeuCodeROMROM


ALUFAdd

MMX/SSEFPmove

ALUALUFMulFMul



StoreStore

4444

4444

5555


out of orderout of order

micromicro--op scheduleop schedulemicromicro--op executeop execute

26

2M/4Mshared L2

Cache

up to10.4 Gb/s

FSB


LoadLoad



ALUBranch

MMX/SSEFPmove

DecodeDecode


uCodeuCodeROMROM


ALUFAdd

MMX/SSEFPmove

ALUALUFMulFMul



StoreStore

4444

4444

5555


out of orderout of order

memory pipelinesmemory pipelines

memory order unitmemory order unitmaintains architecturalmaintains architecturalordering requirementsordering requirements

27

2M/4Mshared L2

Cache

up to10.4 Gb/s

FSB


LoadLoad



ALUBranch

MMX/SSEFPmove

DecodeDecode


uCodeuCodeROMROM


ALUFAdd

MMX/SSEFPmove

ALUALUFMulFMul



StoreStore

4444

4444

5555


in orderin order

micromicro--op retirementop retirementfault handlingfault handling

Retirement UnitRetirement Unitmaintains illusionmaintains illusionof in orderof in orderinstruction retirementinstruction retirement

28

IntelIntel®® CoreCore™™Microarchitecture

2M/4Mshared L2

Cache

up to10.4 Gb/s

FSB


LoadLoad



ALUBranch

MMX/SSEFPmove

DecodeDecode


uCodeuCodeROMROM


ALUFAdd

MMX/SSEFPmove

ALUALUFMulFMul



StoreStore

4444

4444

5555

Microarchitecture

Wide Dynamic Execution

Advanced Digital Media Boost

Smart Memory Access

Advanced Smart Cache

Intelligent Power Capability

New, StateNew, State--ofof--thethe--Art, Art, MicroarchitectureMicroarchitecture

29

2M/4Mshared L2

Cache

up to10.4 Gb/s

FSB


LoadLoad



ALUBranch

MMX/SSEFPmove

DecodeDecode


uCodeuCodeROMROM


ALUFAdd

MMX/SSEFPmove

ALUALUFMulFMul



StoreStore

4444

4444

5555

Wide Dynamic Wide Dynamic ExecutionExecution

Start with Instruction Fetch

four(+) instructions / cycle

>33% increase over other x86 processors

Instructions converted to micro-ops (uops)

~1 uop per x86 instruction

30

MicroMicro--op Reduction op Reduction (recall Processor (recall Processor 101)101)

Fewer Fewer uopsuops per instructionper instructionallows IPC to be increasedallows IPC to be increasedwhile lowering C dynamicwhile lowering C dynamic(less bits and less toggling)(less bits and less toggling)



Power = Cdynamic * V * V * FrequencyPower = Power = CCdynamicdynamic * V * V * Frequency* V * V * Frequency

31

Techniques for MicroTechniques for Micro--op Reductionop Reduction

ESP Tracker (Extended Stack Pointer)ESP Tracker (Extended Stack Pointer)–– Execute Stack Pointer updates in dedicated hardwareExecute Stack Pointer updates in dedicated hardware–– IntelIntel®® CoreCore™™ microarchitecture increases BW by 33%*microarchitecture increases BW by 33%*

MicroMicro--Op MicroOp Micro--FusionFusion–– Single Single UopUop representation of representation of ““multimulti--uopuop”” instruction instruction –– IntelIntel®® CoreCore™™ microarchitecture increase # instructions*microarchitecture increase # instructions*

MacroMacro--FusionFusion–– New technique in IntelNew technique in Intel®® CoreCore™™ microarchitecture (more on microarchitecture (more on

next pages)next pages)

* Techniques pioneered on Intel* Techniques pioneered on Intel®® PentiumPentium®® M processorsM processors

32

New: MacroNew: Macro--FusionFusion

Represent common x86 instruction pairs in Represent common x86 instruction pairs in single microsingle micro--opop––CMP or TEST + Conditional Branch (CMP or TEST + Conditional Branch (JccJcc))

Enhanced Arithmetic Logic Unit (ALU) for Enhanced Arithmetic Logic Unit (ALU) for macromacro--fusionfusion––Single dispatch Single dispatch -- efficiencyefficiency

––Single cycle execution Single cycle execution -- performanceperformance

33

WithoutWithoutMacroMacro--FusionFusion

Instruction Queue

Read four instructions from Instruction Queue

Each instruction gets decoded into separate uops

store [mem3], ebx

load eax, [mem1]cmp eax, [mem2]

jne targ

inc ecx

inc ecx

store [mem3], ebx

dec1 dec2 dec3

jne targ

load eax, [mem1]

cmp eax, [mem2]

dec0

Cycle 1

Cycle 2

34

With IntelWith Intel’’s New s New MacroMacro--FusionFusion

Read five Instructions from Instruction Queue

Send fusable pair to single decoder

Single uop represents two instructions

store [mem3], ebx

load eax, [mem1]cmpjne eax, [mem2], targ

inc ecx

dec1

Instruction Queue

inc ecx

dec2 dec3

load eax, [mem1]

cmp eax, [mem2]

jne targ

store [mem3], ebx

dec0

Cycle 1

35

cmpjne eax, mem2, targScheduler

Execution

flags and target to Write back

BranchEval

MacroMacro--FusionFusion(cont)(cont)

Lower latencyIncreased bandwidth“virtually” increase storage

Macro-fusion makes the machine behave as if it is wider and deeper, withoutthe additional cost

Enabling Greater Performance & Enabling Greater Performance & EfficiencyEfficiency

36

2M/4Mshared L2

Cache

up to10.4 Gb/s

FSB


LoadLoad



ALUBranch

MMX/SSEFPmove

DecodeDecode


uCodeuCodeROMROM


ALUFAdd

MMX/SSEFPmove

ALUALUFMulFMul



StoreStore

4444

4444

5555

Wide Dynamic Wide Dynamic ExecutionExecution

4 wide rename4 wide micro-op execution 4 wide retire

Deeper out of order storage

32 discontiguous micro-opsconsidered for dispatch per cycle

33% Wider Than Previous 33% Wider Than Previous GenerationGeneration

37

2M/4Mshared L2

Cache

up to10.4 Gb/s

FSB


LoadLoad



ALUBranch

MMX/SSEFPmove

DecodeDecode


uCodeuCodeROMROM


ALUFAdd

MMX/SSEFPmove

ALUALUFMulFMul



StoreStore

4444

4444

5555

Advanced Digital Advanced Digital Media BoostMedia Boost

128-bit packed Multiplyplus

128-bit packed Addplus

128-bit packed Loadplus

128-bit packed Storeplus

(how about a CMPJCC)

2x Compute Throughput / 2x Compute Throughput / ClockClock

38

Advanced Digital Media BoostAdvanced Digital Media Boost

Lets scale a vector: Lets scale a vector: B[iB[i] := ] := A[iA[i] * C] * C

A

B

ExistingProcessor

IntelIntel®® CoreCore™™ uuarcharchAdvanced Digital Media BoostAdvanced Digital Media Boost


39

ExistingProcessor


Assume both Microarchitectures have 128Assume both Microarchitectures have 128--bit bit path from L1 to Processorpath from L1 to Processor


A

B


40


...handles all the memory data...handles all the memory data

A

B

ExistingProcessor


Multiply can’tkeep up with load bandwidth

multiplieroperateson all data


41

Advanced Digital Media BoostAdvanced Digital Media BoostExisting implementations eventually stall the load Existing implementations eventually stall the load

pipe waiting for multiplierpipe waiting for multiplier

ExistingProcessor


A

B

Load eventuallystalls waiting formultiplier

Load pipeis free to advance


42


...keeps pipeline free for computations...keeps pipeline free for computations

ExistingProcessor


A

B




43

Advanced Digital Media BoostAdvanced Digital Media Boost...maintains 2X throughput compared to prior ...maintains 2X throughput compared to prior

implementationsimplementations

ExistingProcessor


A

B




44


8 Single Precision Flops/cycle8 Single Precision Flops/cycle

ExistingProcessor


A

B




45

4 Double Precision Flops/cycle4 Double Precision Flops/cycleAdvanced Digital Media BoostAdvanced Digital Media Boost

ExistingProcessor


A

B




46


ExistingProcessor


A

B




47


ExistingProcessor


A

B




48


ExistingProcessor


A

B




49


ExistingProcessor


A

B




50


ExistingProcessor


A

B




51


ExistingProcessor


A

B



Leading Compute DensityLeading Compute Density2x Compute Throughput / 2x Compute Throughput /

ClockClock

52

2M/4Mshared L2

Cache

up to10.4 Gb/s

FSB


LoadLoad



ALUBranch

MMX/SSEFPmove

DecodeDecode


uCodeuCodeROMROM


ALUFAdd

MMX/SSEFPmove

ALUALUFMulFMul



StoreStore

4444

4444

5555

Memory Memory DisambiguationDisambiguation

Improved Improved PrefetchersPrefetchers

Smart MemorySmart MemoryAccessAccess

Hiding Latency to Memory Hiding Latency to Memory SubsystemSubsystem

53

Smart Memory Access Smart Memory Access –– GoalGoal

L1Data

Cache

CORE1

L1Data

Cache

CORE2

Smart-SharedL2 Cache

System Bus

WHENWHEN Ensure data can be used as Ensure data can be used as earlyearly as possibleas possible

WHEREWHERE Ensure user of data has it as Ensure user of data has it as closeclose as possibleas possible


54

Subsequent Loads Must WaitSubsequent Loads Must Wait

Memory

Store1 Load2Store3Load4

Data Y

Data Z

Data W

Data X

Without Memory DisambiguationWithout Memory Disambiguation

Load4 must WAIT until previous stores complete

Waits for Data X before can executeY

W

Y

X

12

4

3

55

Solving the Problem of Solving the Problem of WhenWhen

Memory

Store1 Load2Store3Load4

Data Y

Data Z

Data W

Data X

With IntelWith Intel’’s New Memory Disambiguations New Memory Disambiguation

Loads can decouple from Stores

Load4 can get its data FIRSTY

W

Y

X

23

4

1

Smart Memory AccessSmart Memory Access

56

Memory Disambiguation predictorMemory Disambiguation predictor––Loads that are predicted NOT to forward from Loads that are predicted NOT to forward from

preceding store are allowed to schedule as early preceding store are allowed to schedule as early as possibleas possible–– increasing the performance of OOO memory pipelinesincreasing the performance of OOO memory pipelines

Disambiguated loads checked at retirementDisambiguated loads checked at retirement––Extension to existing coherency mechanismExtension to existing coherency mechanism

––Invisible to software and systemInvisible to software and system

Memory DisambiguationMemory DisambiguationSmart Memory AccessSmart Memory Access


57

Smart Memory AccessSmart Memory AccessPrefetchersPrefetchers

SharedL2

DataCache

oldest

youngest L1Data

Cache

Load1 Load2Load3Load4

58

Smart Memory Access: PrefetchersSmart Memory Access: Prefetchers

oldest

youngest

Memory is too far awayMemory is too far away

L1Data

Cache


SharedL2

DataCache

59


oldest

youngest

Caches are closerCaches are closerwhen they have the datawhen they have the data

L1Data

Cache


SharedL2

DataCache

60


oldest

youngest

Prefetchers detectPrefetchers detectapplications dataapplications datareference patternsreference patterns

L1Data

Cache


SharedL2

DataCache

61


oldest

youngest

And bring the data And bring the data closer to data consumercloser to data consumer

L1Data

Cache


SharedL2

DataCache

62


SharedL2

DataCache

oldest

youngest L1Data

Cache


Solving the Problem of Solving the Problem of Where Where Minimizing Memory LatencyMinimizing Memory Latency

63

Prefetchers and MultiPrefetchers and Multi--CoreCore

2M/4Mshared L2

Cache

up to10.4 Gb/s

FSB


LoadLoad



ALUBranch

MMX/SSEFPmove

DecodeDecode


uCodeuCodeROMROM


ALUFAdd

MMX/SSEFPmove

ALUALUFMulFMul



StoreStore

4444

4444

5555

64


65


2M/4Mshared L2

Cache

up to10.4 Gb/s

FSB


LoadLoad



ALUBranch

MMX/SSEFPmove

DecodeDecode


uCodeuCodeROMROM


ALUFAdd

MMX/SSEFPmove

ALUALUFMulFMul



StoreStore

4444

4444

5555

Three Individual Prefetchers per CoreThree Individual Prefetchers per CoreTwo L2 Prefetchers dynamically Two L2 Prefetchers dynamically

sharedshared

66

Smart Memory AccessSmart Memory Access8 Prefetchers per two8 Prefetchers per two--core processorcore processor–– 2 data and 1 instruction 2 data and 1 instruction prefetcherprefetcher per coreper core

–– able to handle multiple simultaneous patternsable to handle multiple simultaneous patterns–– 2 prefetchers in the L2 cache2 prefetchers in the L2 cache

–– tracking multiple patterns per coretracking multiple patterns per core

Prefetchers monitor demand traffic and regulate Prefetchers monitor demand traffic and regulate ““aggressionaggression””

Implementation Implementation ““knobsknobs”” allow platform and allow platform and segment specific settings tailored to applications segment specific settings tailored to applications and usage modelsand usage models

Data Is Data Is WhereWhere You Need It, You Need It, WhenWhen You Need ItYou Need It

67

Advanced Smart CacheAdvanced Smart CacheMultiMulti--core Optimized core Optimized

All the Smart Cache benefits:• L2 can adapt to each core’s load• Fast data sharing• No replicated data

Plus:• 2X BW to L1 caches

CoreCore22

CoreCore11

L2 CacheL2 Cache

Shared & MultiShared & Multi--Core Optimized, Core Optimized, with 2x Bandwidthwith 2x Bandwidth

68

Advanced Smart Cache Advanced Smart Cache Dynamic Cache AllocationDynamic Cache Allocation

Advanced Advanced Smart CacheSmart Cache

Independent Independent Cache Cache (today)(today)

Core1Core1 Core2Core2Core1Core1 Core2Core2

L2 CacheL2 Cache

Shared Cache adapts to mismatchedloads. Independent Cache can thrashheavy app even when other cache isunder-utilized

L2 L2 CacheCache

L2 L2 CacheCache

69

Advanced Smart Cache Advanced Smart Cache Efficient Data SharingEfficient Data Sharing

Independent CacheIndependent CacheAdvanced Smart Advanced Smart CacheCache

Core2Core2Core1Core1

L2 CacheL2 Cache

Core2Core2Core1Core1

L2 CacheL2 Cache

Main memory

L2 CacheL2 Cache

Main memoryFSBFSB FSBFSB

2X L2 to L1 Bandwidth2X L2 to L1 Bandwidth

70

Intelligent Power CapabilityIntelligent Power Capability

Extending the power management architecture• Intel® Pentium® M processor innovated a new power

management architecture• Intel® Core™ Duo extended the Pentium® M processor

capability to multi-core

New Power Features within each processor core• Ultra fine-grained power control• Split Busses• Platformization of Power Management Architecture

Enhancing Energy EfficiencyEnhancing Energy Efficiency

71

Ultra Fine Grained Power ControlUltra Fine Grained Power Control

Even during periods ofhigh performanceexecution, many partsof the chip core can beshut off.

Example could be aSW memory initializationexecuting from frontend with IQ operatingas loop cache.

ALUBranch

MMX/SSEFPmove

2M/4Mshared L2

Cache

up to10.4 Gb/s

FSB


LoadLoad



DecodeDecode


uCodeuCodeROMROM


ALUFAdd

MMX/SSEFPmove

ALUALUFMulFMul



StoreStore

4444

4444

5555

FP FP FP

Intelligent Power CapabilityIntelligent Power Capability

72

Intelligent Power CapabilityIntelligent Power CapabilitySplit Busses (core power feature)Split Busses (core power feature)

Many buses are sized for worst-case data

(x86 instruction of 15 bytes)(ALU can write-back 128 bits)

Improved Energy EfficiencyImproved Energy Efficiency

73

Intelligent Power CapabilityIntelligent Power CapabilitySplit Busses (core power feature)Split Busses (core power feature)

By splitting buses to dealwith varying data widths,

we can gain the performancebenefit of bus width while

maintaining C dynamiccloser to thinner buses

Improved Energy EfficiencyImproved Energy Efficiency

74

Platformization of Power Platformization of Power Management ArchitectureManagement Architecture

Integrating best features from Server Integrating best features from Server and Mobile productsand Mobile products

Exposing more to the systemExposing more to the system

PSIPSI--22 Power Status Indicator (Mobile)Power Status Indicator (Mobile)

DTSDTS Digital Thermal SensorsDigital Thermal Sensors

PECIPECI Platform Environment Control Platform Environment Control InterfaceInterface

75

Power Status Indicator Power Status Indicator (Mobile)(Mobile)

Processor communicates power consumption Processor communicates power consumption to external platform componentsto external platform components––Optimization of voltage regulator efficiencyOptimization of voltage regulator efficiency

––Load line and power delivery efficiency Load line and power delivery efficiency

PSI-2 / VID

VR

76

Enabling Efficient Processor and Enabling Efficient Processor and Platform Thermal ControlPlatform Thermal Control……

DTS DTS –– Digital Thermal SensorDigital Thermal Sensor

Several thermal sensors are located within the Several thermal sensors are located within the Processor to cover all possible hot spotsProcessor to cover all possible hot spots

Dedicated logic scans the thermal sensors and Dedicated logic scans the thermal sensors and measures the maximum temperature on the die at measures the maximum temperature on the die at any given timeany given time

Accurately reporting Processor temperature enables Accurately reporting Processor temperature enables advanced thermal control schemesadvanced thermal control schemes

LPF

LPF Core 1 DTS Logic

Core 2DTS Logic

DTS control

and status

77

Platform Environment Platform Environment Control Interface (PECI)Control Interface (PECI)

Processor provides its temperature reading over Processor provides its temperature reading over a a multi drop single wire busmulti drop single wire bus allowing efficient allowing efficient platform thermal controlplatform thermal control

ProcessorFan

AuxiliaryFan

Manager

PECI

ChassisFan 1

ChassisFan 2

PROC #2

PROC #3

PROC #1

78

2M/4Mshared L2

Cache

up to10.4 Gb/s

FSB


LoadLoad



ALUBranch

MMX/SSEFPmove

DecodeDecode


uCodeuCodeROMROM


ALUFAdd

MMX/SSEFPmove

ALUALUFMulFMul



StoreStore

4444

4444

5555

Microarchitecture Microarchitecture Feature SummaryFeature Summary

New, StateNew, State--ofof--thethe--Art, Art, MicroarchitectureMicroarchitecture

Wide Dynamic Execution33% wider pipes (4 vs. 3) and greater efficiency

Advanced Digital Media Boost2x compute throughput / clock

Smart Memory AccessMinimizing latency – Data Where & When needed

Advanced Smart CacheMulti-Core optimized, shared with 2x bandwidth

Intelligent Power CapabilityImproved energy efficientperformance

79

AgendaAgenda



–– IntelIntel®® CoreCore™™ Microarchitecture Design Goals and RoadmapMicroarchitecture Design Goals and Roadmap




–– SummarySummary

80

IntelIntel®® CoreCore™™ Microarchitecture Microarchitecture and Softwareand Software

Software consistency across application spaceSoftware consistency across application space–– Wide Dynamic Execution will provide generic performance gains Wide Dynamic Execution will provide generic performance gains –– Smart Memory Access targets memory intensive appsSmart Memory Access targets memory intensive apps–– Advanced Digital Media Boost provides a leap in capability for Advanced Digital Media Boost provides a leap in capability for

media and floating point appsmedia and floating point apps–– MultiMulti--Core and Advanced Smart Cache further improve the Core and Advanced Smart Cache further improve the

growing number of multigrowing number of multi--threaded applicationsthreaded applications

Software consistency across markets segmentsSoftware consistency across markets segments––New apps and optimizations can target single New apps and optimizations can target single

microarchitecturemicroarchitecture

Immediate Performance Immediate Performance Increase Across Applications Increase Across Applications

and Segmentsand Segments

81

Agenda

• Multi-core Update and New Micro-architecture level set


• Summary

82

Summary

Continuing to drive aggressive multi-core ramp– Dual-core ramp in 2006, quad-core starts in early 2007

Intel® Core™ microarchitecture delivers leading performance and performance/watt

– Conroe – >40% performance increase1 / >40% less power– Woodcrest - >80% performance increase1 / >35% less power– Mobile - Extending leadership delivered with Intel® Core™ Duo

with >20% performance increase1

On track for product introductions starting in Q3’06– Based upon new Intel® Core™ microarchitecture

1 - Estimated SPECint* rate

Documents

N E W Intel® Core™ Microarchitecture · Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) Woodcrest (Q3’06) ... Eval Macro-Fusion (cont) Lower latency Increased bandwidth