® * Other brands and names may be claimed as the property of others. ECE 371 Microprocessors Chapter 6 Intel © x86 Microprocessor Architecture Derived

®®

* Other brands and names may be claimed as the property of others.

ECE 371Microprocessors

Chapter 6Intel© x86 Microprocessor

ArchitectureDerived from Dr. Herbert G. Mayer 2003 Presentation Derived from Dr. Herbert G. Mayer 2003 Presentation

totoIntel’s Software CollegeIntel’s Software College

Status 8/30/2015Status 8/30/2015For use at CCUT Fall 2015For use at CCUT Fall 2015

®®


2

AgendaAgenda

AssumptionsAssumptions Speed LimitationsSpeed Limitations x86 Architecture Progressionx86 Architecture Progression Architecture EnhancementsArchitecture Enhancements Intel ® x86 ArchitecturesIntel ® x86 Architectures

®®


3

AssumptionsAssumptions Audience: Understands general x86 architectureAudience: Understands general x86 architecture Knows some assembly languageKnows some assembly language

– Flavor used here: Gnu assembler gasFlavor used here: Gnu assembler gas– Result on right-hand-side:Result on right-hand-side:

– mov [temp], %eax;mov [temp], %eax; is a load into register ais a load into register a– add %eax, %ebx;add %eax, %ebx; new integer sum is in register bnew integer sum is in register b

– Different from Microsoft * masm, and tasmDifferent from Microsoft * masm, and tasm

Understand some architectural concepts:Understand some architectural concepts:– Caches, Multi-level caches, (some MESI)Caches, Multi-level caches, (some MESI)– Threading, multi-threaded codeThreading, multi-threaded code– Blocking (cache), blocking (aka tiling), blocking (thread synch.)Blocking (cache), blocking (aka tiling), blocking (thread synch.)

Causes of pipeline stallsCauses of pipeline stalls– Control flow changeControl flow change– Data dependence, registers and dataData dependence, registers and data

NOT discussed: VTune, CISC vs. RISCNOT discussed: VTune, CISC vs. RISC

®®


4

Speed LimitationsSpeed Limitations

®®


5

AgendaAgenda

Performance LimitersPerformance Limiters Register Starvation Register Starvation Processor-Memory GapProcessor-Memory Gap Processor StallsProcessor Stalls Store ForwardingStore Forwarding Misc Limitations:Misc Limitations:

– Spin-Lock in Multi ThreadSpin-Lock in Multi Thread

– Misaligned DataMisaligned Data

– Denorm FloatsDenorm Floats

®®


6

Performance LimitersPerformance Limiters

Architectural limitations the programmer or Architectural limitations the programmer or compiler can overcome:compiler can overcome:– Indirect limitations: stall via branch, call, returnIndirect limitations: stall via branch, call, return

– Incidental limits: resource constraintIncidental limits: resource constraint

– Historical limits: register starved x86Historical limits: register starved x86

– Technological: ALU speed vs. memory access speedTechnological: ALU speed vs. memory access speed

– Logical limits: data- and resource dependenceLogical limits: data- and resource dependence

®®


7

Register StarvationRegister Starvation How many regs needed (compiler or programmer)?How many regs needed (compiler or programmer)?

– Infinite is perfect Infinite is perfect – 1024 is very good1024 is very good– 64 acceptable64 acceptable– 16 is crummy16 is crummy– 4+4 is x864+4 is x86– 1 is saa (single-accumulator architecture)1 is saa (single-accumulator architecture)

Formally on x86: 16 regs. Quick test:Formally on x86: 16 regs. Quick test:– ax, bc, cx, dx– si, di– bp, sp, ip– cs, ds, ss, es, fs, gs, flags

Of which Of which ax, bx, cx, dx are GPRs, almost are GPRs, almost Rest can be used as better tempsRest can be used as better temps ax & & dx used for * and /, used for * and /, cx for loop for loop

®®


8

Register StarvationRegister Starvation Absence of regs causesAbsence of regs causes

– Spurious memory spills and loadSpurious memory spills and load

– False data dependences --not dependencies False data dependences --not dependencies

Except single-accumulator arch: No other Except single-accumulator arch: No other arch is more register starved than x86 arch is more register starved than x86

Instruction Stream

mov %eax, [mem1]use stuff, %eaxmov [mem1], %eax

Added ops

Mem latency

Instruction Stream

mov %eax, [tmp] add %ebx, %eax imul %ecx mov %eax, [prod] mov [tmp], %eax

False DD

®®


9

And the Programmer?And the Programmer? No solution in ISA, x86 had 4 GPRs since 8086No solution in ISA, x86 had 4 GPRs since 8086 Improved via internal register renamingImproved via internal register renaming

– Pentium ® Pro has hundreds of internal regsPentium ® Pro has hundreds of internal regs

Added registers in mmxAdded registers in mmx– Visible to you, programmer and compilerVisible to you, programmer and compiler

– fp(0) .. fp(7), 80-bits as FP, 64 bits as mmx, but note: fp(0) .. fp(7), 80-bits as FP, 64 bits as mmx, but note: context switchcontext switch

Added registers in SSEAdded registers in SSE– xmm(0) .. xmm(7) 128 bitsxmm(0) .. xmm(7) 128 bits

®®


10

Processor-Memory GapProcessor-Memory Gap

µProc60%/yr.

DRAM7%/yr.

1

10

100

1000

1980

1981

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

DRAM

CPU1982

Processor-MemoryPerformance Gap:(grows 50% / year)

Per

form

ance

Time

“Moore’s Law”

Source: David Patterson, UC Berkeley

2001

2002

®®


Bridging the Gap: TrendBridging the Gap: Trend

DRAM

CPU

CachesMultilevel

Caches

Per

form

ance

Time Instruction Level

Thread Level

Intel® Pentium II Processor:

Out of Order Execution

~30%

Intel® Xeon™ Processor:Hyperthreading

Technology ~30%

Hyperthreading Technology:Hyperthreading Technology:Feeds two threads to exploit shared execution unitsFeeds two threads to exploit shared execution units

®®


12

Impact of Memory LatencyImpact of Memory Latency

Memory speed has Memory speed has NOTNOT kept up with kept up with advance in processor speedadvance in processor speed– Avg. integer add ~ 0.16 ns (Xeon), but memory Avg. integer add ~ 0.16 ns (Xeon), but memory

accesses take ~10 ns or moreaccesses take ~10 ns or more

CPU hardware resource utilization is only CPU hardware resource utilization is only 35%35% on average on average– Limited due to memory stalls and dependenciesLimited due to memory stalls and dependencies

Possible solutions to memory speed Possible solutions to memory speed mismatch?mismatch?

Memory speed mismatch is a major source of CPU stallsMemory speed mismatch is a major source of CPU stalls

®®


13

And the Programmer?And the Programmer? Cache providedCache provided Methods to manipulate cacheMethods to manipulate cache Tools provided to pre-fetch dataTools provided to pre-fetch data

– At risk of superfluous fetch, if control-flow At risk of superfluous fetch, if control-flow changechange

®®


14

Processor StallsProcessor Stalls Stalled cycle is a cycle in which processor cannot Stalled cycle is a cycle in which processor cannot

receive or schedule new instructionsreceive or schedule new instructions– Total Cycles = Total Stall Cycles + Productive CyclesTotal Cycles = Total Stall Cycles + Productive Cycles

– Stalls waste processor cyclesStalls waste processor cycles

– Perfmon, Linux ps, tops, other system tools show Stalled Perfmon, Linux ps, tops, other system tools show Stalled cycles as busy CPU cyclescycles as busy CPU cycles

– Intel® VTune Analyzer used to monitor stalls (HP* PFmon)Intel® VTune Analyzer used to monitor stalls (HP* PFmon)

Unstalled

Stalled

®®


15

Why Stalls Occur!Why Stalls Occur! Stalls occur, becauseStalls occur, because::

– Instruction needs resource not availableInstruction needs resource not available

– Dependences [sic] (control- or data-) between instructionsDependences [sic] (control- or data-) between instructions

– Processor / instruction waits for some signal or eventProcessor / instruction waits for some signal or event

Sample resource limitations:Sample resource limitations:– RegistersRegisters

– Execution portsExecution ports

– Execution unitsExecution units

– Load / store portsLoad / store ports

– Internal buffers (ROBs, WOBs , etc.)Internal buffers (ROBs, WOBs , etc.)

Sample eventsSample events::– Exceptions, Cache misses, TLB misses, e.t.c.Exceptions, Cache misses, TLB misses, e.t.c.

– Common thing: they hold up compute progressCommon thing: they hold up compute progress

®®


16

Control Dependences (CD)Control Dependences (CD) Change in flow of control causes stallsChange in flow of control causes stalls Processors handle control dependences:Processors handle control dependences:

– Via branch prediction hardwareVia branch prediction hardware

– Conditional move to avoid branch & pipeline stallConditional move to avoid branch & pipeline stall

Instruction Stream

mov [%ebp+8], %eax cmp 1, %eax jg bigger mov 1, %eax . . .bigger:

Barrier(Predict)

Instruction Stream

dec %ecxpush %eaxcall rfactmov %ecx,[%ebp+8]mul %ecx

Barrier(Predict)

®®


17

Data Dependences (DD)Data Dependences (DD) Data dependence limits performanceData dependence limits performance Programmer / Compiler cannot solveProgrammer / Compiler cannot solve

– Xeon has register renaming to avoid false data Xeon has register renaming to avoid false data dependenciesdependencies

– supports out of order execution to hide effects of supports out of order execution to hide effects of dependenciesdependencies

Instruction Stream

. . .mov eax, [ebp+8]cmp eax, 1

Mem latency

Instruction Stream

mov [temp], eax add eax, ebx mult ecx mov [prod], eax mov eax, [temp] . . .bigger:

False DD

®®


18

Xeon Processor StallsXeon Processor Stalls D-sideD-side

– DTLB MissesDTLB Misses– Memory Hierarchy Memory Hierarchy L1, L2 and L3 misses L1, L2 and L3 misses

CoreCore– Store Buffer StallsStore Buffer Stalls

– Load/Store splitsLoad/Store splits– Store forwarding hazardStore forwarding hazard– Loading partial/misaligned dataLoading partial/misaligned data

– Branch MispredictsBranch Mispredicts

I-sideI-side– Streaming Buffer MissesStreaming Buffer Misses– ITLB MissesITLB Misses– TC missesTC misses– 64K Aliasing conflicts64K Aliasing conflicts

MiscMisc– Machine ClearsMachine Clears

®®


19

And the Programmer?And the Programmer?

Reduce processor stall by prefetching dataReduce processor stall by prefetching data Reduces control flow change by conditional Reduces control flow change by conditional

movemove Reduce false dependences by using register Reduce false dependences by using register

temps, from mmx (fp) and xmm pooltemps, from mmx (fp) and xmm pool

®®


20

Partial Writes: WC buffersPartial Writes: WC buffers

First Level Cache

Fill/WC BufferFill/WC BufferFill/WC BufferFill/WC Buffer

8B 8B 8B -

Incomplete WC buffer3 - 8B “Partial” bustransactions8B 8B 8B 8B

Complete WC buffer1 bus transaction

Second Level Cache

Memory

Detection (VTune)Event based sampling:

Ext. Bus Partial Write Trans.

Causes:

L2 Cache Request

Ext. Bus Burst Read Trans.

Ext. Bus RFO Trans.

Causes:1) Too many WC streams

2) WB loads/stores contending for fill-buffers to access L2 cache or memory

Partial writes reduce actual front-side bus Bandwidth Partial writes reduce actual front-side bus Bandwidth – ~3x lower for PIII~3x lower for PIII– ~7x lower for ~Pentium 4 processor due to longer cache line~7x lower for ~Pentium 4 processor due to longer cache line

FSBFSB

®®


21

Store Forwarding GuidelinesStore Forwarding Guidelines

A

Will Forward Forwarding Penalty

Store

LoadLoad aligned with Store

Load contained in Store

128-bit forwards must be16-byte aligned

Store

Load

Store

Load

Store

Load

Store

Load

Store

Load

16-byte boundaries

Load contained in single Store

BStore

Load

Store

Load

Store Forward: Loading from an address recently stored can cause data to be fetched more quickly than via mem access.Large penalty for non-forwarding cases (1.1-1.3x)

MSVC < 7.0 and you generate these. Intel Compiler doesn’t.MSVC < 7.0 and you generate these. Intel Compiler doesn’t.

®®


22


Pick right compiler, for HLL programsPick right compiler, for HLL programs Use VTune to check, for asm codeUse VTune to check, for asm code In asm programs, ensure loads after stores are:In asm programs, ensure loads after stores are:

– Contained in stored data, subset or proper subsetContained in stored data, subset or proper subset– In single previous store, not in sum of multiple In single previous store, not in sum of multiple

storesstores– Thus do store-combining: assemble together, then Thus do store-combining: assemble together, then

storestore– Both data start on same addressBoth data start on same address

®®


23

Misc LimitationsMisc Limitations

Spin-Lock in Multi ThreadSpin-Lock in Multi Thread– Don’t use busy wait, juts because you have (almost) a second Don’t use busy wait, juts because you have (almost) a second

processor for second threadprocessor for second thread Misaligned dataMisaligned data

– Don't align data on arbitrary boundary, just because Don't align data on arbitrary boundary, just because architecture can fetch from any addressarchitecture can fetch from any address

Dumb errors Dumb errors – Fail to use proper tool (library, compiler, performance analyzer)Fail to use proper tool (library, compiler, performance analyzer)– Failure to use tiling (aka blocking) or SW pipeliningFailure to use tiling (aka blocking) or SW pipelining

Denormalized FloatsDenormalized Floats

®®


24


Use pause, when applicable!Use pause, when applicable!– New NetBurst instructionNew NetBurst instruction

Use compiler switches to align data on address Use compiler switches to align data on address divisible by greatest individual data objectdivisible by greatest individual data object

– Who cares about wasting 7 bytes to force 8-byte alignment?Who cares about wasting 7 bytes to force 8-byte alignment? Be smart, pick right tools Be smart, pick right tools

– Instruct compiler to SW pipelineInstruct compiler to SW pipeline– In asm, manually SW pipeline; note easier on EPIC than In asm, manually SW pipeline; note easier on EPIC than

VLIW, lacking prologue, epilogue sometimesVLIW, lacking prologue, epilogue sometimes– Enable compiler to partition larger data structures into Enable compiler to partition larger data structures into

smaller suitable blocks, for improved localitysmaller suitable blocks, for improved locality– cache parameter dependentcache parameter dependent

®®


25


Executes for first of 2 labs, this one being a Executes for first of 2 labs, this one being a "two-minute" exercise:"two-minute" exercise:

Turn on your computer, verify Linux is alive Turn on your computer, verify Linux is alive Verify you have available:Verify you have available:

– Editor to modify programEditor to modify program– Intel C++ compiler, text command icc, with -gIntel C++ compiler, text command icc, with -g– Debugger ddd, with disassembly abilityDebugger ddd, with disassembly ability

Source program vscal.cppSource program vscal.cpp Linux commands: ls, vi, icc, mkdir, etc.Linux commands: ls, vi, icc, mkdir, etc.

®®


26

Module SummaryModule Summary

Covered: key causes that render execution Covered: key causes that render execution slower than possible:slower than possible:

More registers at your disposal than seemsMore registers at your disposal than seems Van Neumann bottleneck can be softened via Van Neumann bottleneck can be softened via

cache use and data pre-fetchcache use and data pre-fetch Stalls can be reduced by conditional move, Stalls can be reduced by conditional move,

avoiding false dependencesavoiding false dependences Use (time limited) capabilities, such as proper Use (time limited) capabilities, such as proper

store forwardingstore forwarding Note new Pause instructionNote new Pause instruction

®®


27

x86 Architecturex86 ArchitectureProgressionProgression

®®


28

Agenda: x86 Arch. ProgressionAgenda: x86 Arch. Progression

Abstract & ObjectivesAbstract & Objectives x86 Nomenclature & Notationx86 Nomenclature & Notation Intel® Architecture ProgressIntel® Architecture Progress Pentium 4 AbstractPentium 4 Abstract

®®


29

Abstract & Objectives:Abstract & Objectives:x86 Architecture Progressionx86 Architecture Progression Abstract: High-level introduction to history and Abstract: High-level introduction to history and

evolution of increasingly powerful 16-bit and evolution of increasingly powerful 16-bit and 32-bit x86 processors that are backwards 32-bit x86 processors that are backwards compatible.compatible.

Objectives: understand processor generations Objectives: understand processor generations and architectural features, by learningand architectural features, by learning– Progressive architectural capabilitiesProgressive architectural capabilities

– Names of corresponding Intel processorsNames of corresponding Intel processors

– Explanation, description of capabilitiesExplanation, description of capabilities

– FP incompatibility, minorFP incompatibility, minor

®®


30

Non-ObjectivesNon-Objectives

Objective is Objective is notnot introduction of: introduction of:

– x86 assembly language, assumed knownx86 assembly language, assumed known

– Itanium ® processor family now in 3Itanium ® processor family now in 3rdrd generation generation

– Intel tools (C++, VTune)Intel tools (C++, VTune)

– Performance tools: MS Perfmon, Linux ps, Performance tools: MS Perfmon, Linux ps, emon, HP PFMon, etc.emon, HP PFMon, etc.

– Performance benchmarks, performance countersPerformance benchmarks, performance counters

– Differentiation Intel vs. competitor productsDifferentiation Intel vs. competitor products

– CISC vs. RISCCISC vs. RISC

®®


31

x86 Nomenclature & Notationx86 Nomenclature & Notation

Pentium ® II, 2H98, 450 MHzPentium ® II, 2H98, 450 MHzMMX, BX chipsetMMX, BX chipsetDynamic branch prediction enhancedDynamic branch prediction enhanced

Processor name, initial launch date, final clock speedProcessor name, initial launch date, final clock speed

Architecturally visible enhancement list, can be emptyArchitecturally visible enhancement list, can be empty

Architectural speedup technique, invisible exc. higher speedArchitectural speedup technique, invisible exc. higher speed

®®


32

Intel® Architecture ProgressIntel® Architecture Progress

Pentium ® Pro, 2H95, 100 MHz

,

Dynamic branch prediction

8086, 2H80, 4 MHz

,

8087

80485, 2H2h85, 10 MHz

,

FP integrated

Pentium ®, 1988, 40 MHz

,

D+I caches, static branch prediction

Pentium ® 4, 2H00, 3.06 GHz

SSE2, 144 WNI, NetBurst ®

L3 on chip cache

Pentium ® II, 2H98, 450 MHz

MMX, BX chipset

Dynamic branch prediction enhanced

Pent

ium

® II

I, 2H

99, 7

33 M

Hz

SSE,

XM

M re

gs

Larg

e ca

che,

l2 o

nchi

p

®®


33

Intel ® Pentium ® 4 ProcessorsIntel ® Pentium ® 4 ProcessorsProcessorProcessor FamilyFamily DescriptionDescription

NorthwoodNorthwood Pentium ®Pentium ® Willamette shrink. Consumer and business desktop processor. HT not enabled, though capable.

NW E-StepNW E-Step PentiumPentium HT errata corrected. Desktop processor

PrescottPrescott PentiumPentium Consumer and business desktop processor. Replaces NW. Offers 6 PNI: Prescott New Instructions. First processor with Lagrande technology (trusted computing)

Prestonia DPPrestonia DP Xeon Xeon TMTM DP slated for workstations and entry-level servers. Based on NW core. HT enabled. 512 kB L2 cache. No L3. 3 GHz processor.

Nocona DPNocona DP XeonXeon DP based on Prescott core. Targeted for 3.06 GHz. 533 MHz (quad-pumped) bus, I.e. bus speed is 133 MHz. 1 MB L2 cache. HT enabled. About to be launched.

Foster MPFoster MP XeonXeon MP based on Willamette core. 1 MB L3 cache, 256 kB L2, HT enabled. For higher-end servers.

Gallatin MPGallatin MP XeonXeon MP based on NW core. 1 or 2 MB L3 cache, 512 kB L2 cache. For high-end servers. See 8-way HP DL 760, and IBM x440. HT enabled.

Potomac MPPotomac MP XeonXeon MP based on Prescott core. 533 MHz (quad-pumped) bus. 1 MB L2 cache, 8 MB L3 cache. HT enabled, yet to be launched.

Note: lower clock rates for MP versions.Due to higher circuit complexity,

bus load.

®®


34

Processor Generation ComparisonProcessor Generation Comparison

FeatureFeature

MHzMHz

Execution TypeExecution Type

MMX™ TechnologyMMX™ Technology

Streaming SIMDStreaming SIMDExtensionsExtensions

Yes

Pentium® IIIPentium® IIIProcessorProcessor

Yes

Dynamic

600 MHz – 1.13GHz

System BusSystem Bus

1.5 GHz

Intel® NetBurst™Arch

Yes Yes

400MHz(4x100 MHz)

133MHz

Streaming SIMD Streaming SIMD Extensions 2Extensions 2 No Yes

Pentium® 4Pentium® 4ProcessorProcessor

Yes

Pentium® IIIPentium® IIIProcessorProcessor

Yes

Dynamic

450-600 MHz

100MHz

No

L2 Cache L2 Cache 512k off-die 256k on-die 256k on-die 512k on-die

2+ GHz

NorthwoodNorthwood

400/533MHz(4x100/133 MHz)

Yes Yes

Yes

Manufacturing Manufacturing ProcessProcess

ChipsetChipset ICH-1ICH-1 ICH-2ICH-2 ICH-2ICH-2 ICH-2ICH-2

.25 micron.25 micron .18 micron.18 micron .18 micron.18 micron .13 micron.13 micron

Intel® NetBurst™Arch

®®


35


8087 co-processor of 80868087 co-processor of 8086: off-chip FP : off-chip FP computation, extended 80-bit FP format for DPcomputation, extended 80-bit FP format for DP

MMXMMX: multi-media extensions: multi-media extensions– Mmx regs aliased w. FP register stackMmx regs aliased w. FP register stack– needs context switchneeds context switch– FP regs also called ST(I) regsFP regs also called ST(I) regs

SSESSE: Streaming SIMD extension already since : Streaming SIMD extension already since Pentium IIIPentium III

WNIWNI: 144 new instructions, using additional data : 144 new instructions, using additional data types for existing opcodes, using previously types for existing opcodes, using previously reserved opcodesreserved opcodes

®®


36


XMMXMM: 8 new 128-bit registers, in addition to : 8 new 128-bit registers, in addition to MMXMMX

SSE2SSE2: multiple integer ops and multiple DP FP : multiple integer ops and multiple DP FP ops: part of 144 WNIops: part of 144 WNI– Regs unchanged in Pentium ® 4 from P IIIRegs unchanged in Pentium ® 4 from P III

– Ops addedOps added

NetBurstNetBurst: generic term for: HyperThreading & : generic term for: HyperThreading & quad-pumped bus & new Trace Cache & etc.quad-pumped bus & new Trace Cache & etc.

Note: architectural feature ageswith next generation, but survives, dueto compatibility requirement. Hence is

interesting not only for historical reasons:You need to know it!

®®


37

XeonXeonTMTM MP Abstract MP Abstract

2020HyperthreadingHyperthreading

TechnologyTechnology

Xeon™ MP Processor“Gallatin”

64 GB64 GB(PAE-36)(PAE-36)

8 Integer, 8 Integer,

1 Multimedia1 Multimedia

2 2 FloatingFloating

PointPoint

2.0+ GHz2.0+ GHz

1 2 3 424 Registers (126)24 Registers (126)

HyperthreadingHyperthreadingTechnologyTechnology

3 Instructions / Cycle3 Instructions / Cycle

L3 – 1or 2 MB L3 – 1or 2 MB L2 - 512KB L2 - 512KB L1 - 12K TC, 8K D L1 - 12K TC, 8K D

652xALU

3.2 GB/s3.2 GB/s(400)(400)

Physical Addressing (36-bit P Pro)Physical Addressing (36-bit P Pro)

On-die CacheOn-die Cache

Pipeline StagesPipeline Stages

RegistersRegisters

Execution UnitsExecution Units

Core FrequencyCore Frequency

Issue PortsIssue Ports

Logical CPU 2 XLogical CPU 2 X

System Bus BandwidthSystem Bus Bandwidth

Instructions/clock-cycleInstructions/clock-cycle

External CacheExternal Cache

®®


38

XeonXeonTMTM Memory Hierarchy Memory Hierarchy

Xeon™ Processor MP

12.8 GB/s

L2 (unif'd) 512KB8-way128B lines7+ CLKS

L32MB8-way128B lines21+ CLKS

ExternalMemory

64GB 3.2 GB/sL1(DL0)8KB64B lines2 CLKS

TC12KB64B lines2 CLKS

Note: Physical Address Extension,36-bit PAE addresses,since Pentium ® Pro

®®


39

ArchitectureArchitectureEnhancementsEnhancements

®®


40

Agenda: Architecture EnhancementsAgenda: Architecture Enhancements

Abstract & ObjectivesAbstract & Objectives Faster ClockFaster Clock Caches: Advantage, Cost, LimitationCaches: Advantage, Cost, Limitation Multi-Level Cache-Coherence in MPMulti-Level Cache-Coherence in MP Register RenamingRegister Renaming Speculative, Out of Order ExecutionSpeculative, Out of Order Execution Branch Prediction, Code StraighteningBranch Prediction, Code Straightening

®®


41

Abstract & Objectives:Abstract & Objectives:Architecture EnhancementsArchitecture Enhancements Abstract: Outline generic techniques that Abstract: Outline generic techniques that

overcome performance limitationsovercome performance limitations Objectives: under stand cost of architectural Objectives: under stand cost of architectural

techniques (tricks) in terms of resources (mil techniques (tricks) in terms of resources (mil space) and of lost performance if incorrectly space) and of lost performance if incorrectly guessedguessed– Caches: cost silicon, can slow downCaches: cost silicon, can slow down

– Branch prediction: costs silicon, can be wrongBranch prediction: costs silicon, can be wrong

– Prefetch: costs instruction, may be superfluousPrefetch: costs instruction, may be superfluous

– Superscalar: may not find a second opSuperscalar: may not find a second op

®®


42


Objective is not to explain detail of Intel Objective is not to explain detail of Intel processor architectureprocessor architecture

Not to claim Intel invented techniques; Not to claim Intel invented techniques; academia invented manyacademia invented many

Not to show all techniques; some apply Not to show all techniques; some apply mainly to EPIC or VLIW architecturesmainly to EPIC or VLIW architectures

No hype, no judgment, just the facts please! No hype, no judgment, just the facts please!

®®


43

Faster ClockFaster Clock CISC:CISC:

– Decompose circuitry into multiple simple, sequential modulesDecompose circuitry into multiple simple, sequential modules Resulting modules are smaller and thus can be fast:Resulting modules are smaller and thus can be fast:

– high clock ratehigh clock rate– Shorter speed-pathsShorter speed-paths

That's what we call: pipelined architectureThat's what we call: pipelined architecture More modules -> simpler modules -> faster clock -> More modules -> simpler modules -> faster clock ->

super-pipelinedsuper-pipelined Super-pipelining NOT goodness per-se:Super-pipelining NOT goodness per-se:

– Saves no siliconSaves no silicon– Execution time per instruction does not improveExecution time per instruction does not improve– May get worse, due to delay cyclesMay get worse, due to delay cycles

But:But:– Instructions retired per unit time improvesInstructions retired per unit time improves– Especially in absence of (large number of) control-flow stallsEspecially in absence of (large number of) control-flow stalls

®®


44

Faster ClockFaster Clock Xeon Xeon TMTM processor pipeline has 20 stages processor pipeline has 20 stages

Beautiful model breaks upon control transferBeautiful model breaks upon control transfer

IntelIntel®® NetBurst NetBurstTMTM µarchitecture: 20 stage pipeline µarchitecture: 20 stage pipeline

11 22 33 44 55 66 77 88 99 1010 1111 1212

TC Nxt IPTC Nxt IP TC FetchTC Fetch DriveDrive AllocAlloc RenameRename QueQue SchSch SchSch SchSch

1313 1414

DispDisp DispDisp

1515 1616 1717 1818 1919 2020

RFRF ExEx FlgsFlgs Br CkBr Ck DriveDriveRF RF

ALU opALU op

I-FetchI-Fetch

R StoreR Store

Decode Decode

O1-FetchO1-Fetch

O2-FetchO2-Fetch

..

I-FetchI-Fetch

Decode Decode

O1-FetchO1-Fetch

O2-FetchO2-Fetch

ALU opALU op

R StoreR Store

®®


45

IntelIntel®® x86 x86 ArchitecturesArchitectures

®®


46

Agenda: Intel x86 ArchitecturesAgenda: Intel x86 Architectures Abstract & ObjectivesAbstract & Objectives High Speed, Long PipeHigh Speed, Long Pipe MultiprocessingMultiprocessing MMX OperationsMMX Operations SSE OperationsSSE Operations SSE2 OperationsSSE2 Operations Willamette New Instructions WNIWillamette New Instructions WNI Cacheability InstructionsCacheability Instructions Pause InstructionPause Instruction NetBurst, HyperthreadingNetBurst, Hyperthreading SW ToolsSW Tools

®®


47

Abstract & Objectives:Abstract & Objectives:IntelIntel®® x86 Architectures x86 Architectures Abstract: Emphasizing Pentium ® 4 processors, show Abstract: Emphasizing Pentium ® 4 processors, show

progressively more powerful architectural features progressively more powerful architectural features introduced in Intel processors. Refer to speed introduced in Intel processors. Refer to speed problems solved from module 2 and general solutions problems solved from module 2 and general solutions explained in module 3.explained in module 3.

Objective: you not only understand the various Objective: you not only understand the various processor product names and supported features (Intel processor product names and supported features (Intel marketing names), but understand how they work, and marketing names), but understand how they work, and what their limitations and costs are.what their limitations and costs are.

®®


48


Objective is not to show Intel's techniques Objective is not to show Intel's techniques are the only ones, or best possible. They are are the only ones, or best possible. They are just good trade-off in light of conflicting just good trade-off in light of conflicting constraints:constraints:– Clock speed vs. small # of pipesClock speed vs. small # of pipes

– Small transistor count vs. high performanceSmall transistor count vs. high performance

– Large caches vs. small mil. SpaceLarge caches vs. small mil. Space

– Grandiose architecture vs. backward compatibilityGrandiose architecture vs. backward compatibility

– Need for large register file vs. register-starved x86Need for large register file vs. register-starved x86

– Wish to have two full on-die processors vs. Wish to have two full on-die processors vs. preserving silicon spacepreserving silicon space

®®


High Speed, Long NetBurst High Speed, Long NetBurst TMTM Pipe Pipe

11 22 33 44 55 66 77 88 99 1010

FetchFetch FetchFetch DecodeDecode DecodeDecode DecodeDecode RenameRename ROB RdROB Rd Rdy/SchRdy/Sch DispatchDispatch ExecExec

Basic Pentium ® Pro PipelineBasic Pentium ® Pro Pipeline

Hyper pipelined Technology enables industry Hyper pipelined Technology enables industry leading performance and clock rateleading performance and clock rate

Hyper pipelined Technology enables industry Hyper pipelined Technology enables industry leading performance and clock rateleading performance and clock rate

Basic Basic NetBurst™ Micro-architecture PipelinePipeline

11 22 33 44 55 66 77 88 99 1010 1111 1212

TC Nxt IPTC Nxt IP TC FetchTC Fetch DriveDrive AllocAlloc RenameRename QueQue SchSch SchSch SchSch

1313 1414

DispDisp DispDisp

1515 1616 1717 1818 1919 2020

RFRF ExEx FlgsFlgs Br CkBr Ck DriveDriveRF RF

Intro at Intro at 733MHz733MHz

.18µ.18µ

1.4 1.4

GHz .18 µGHz .18 µ2.2GHz .132.2GHz .13

µµ

®®


50

Check Your ProgressCheck Your Progress

33 44 55 66 77 88 99 10102211 1111 1212 1313 1414 1515 1616 1717 1818 20201919

Execute: Execute the Execute: Execute the ops on the correct ops on the correct port; 1 clkport; 1 clk

Flags: Compute Flags: Compute flags (0, negative, flags (0, negative, etc.); etc.); 1 clk1 clk

Trace Cache Fetch:Trace Cache Fetch:Read decoded Read decoded op op from TC; 2 clksfrom TC; 2 clks

Register File: Read Register File: Read the register file; the register file; 2 clks2 clks

Drive: Drive Drive: Drive ops to ops to the Allocator; 1 clkthe Allocator; 1 clk

Trace Cache/Next Trace Cache/Next IP: Read from IP: Read from Branch Target Branch Target Buffer; 2 clksBuffer; 2 clks

Dispatch: Send Dispatch: Send ops to appropriate ops to appropriate execution unit; 2 execution unit; 2 clksclks

Rename: Rename Rename: Rename logical regs to logical regs to physical regs; 2 physical regs; 2 clksclks

Drive: Drive the Drive: Drive the branch result to BTB branch result to BTB at front; 1 clkat front; 1 clk

Allocate: Allocate Allocate: Allocate resources for resources for execution; 1 clkexecution; 1 clk

Branch Check: Branch Check: Compare act. Compare act. branch to predicted; branch to predicted; 1 clk1 clk

Queue: Write Queue: Write op op into into op queue to op queue to wait for wait for scheduling; 1 clkscheduling; 1 clk

Schedule: Write to Schedule: Write to schedulers; compute schedulers; compute dependencies; 3 clksdependencies; 3 clks

Match pipe functions to clocks/stages

®®


51

Multiprocessing, SMPMultiprocessing, SMP

Def: Execution of 1 task by >= 2 processorsDef: Execution of 1 task by >= 2 processors Floyd Model (1960s):Floyd Model (1960s):

– Single-Instruction, Single-Data Stream (SISD) Single-Instruction, Single-Data Stream (SISD) Architecture (PDP-11)Architecture (PDP-11)

– Single-Instruction, Multiple-Data Stream (SIMD) Single-Instruction, Multiple-Data Stream (SIMD) Architecture (Array Processors, Solomon, Illiac IV, Architecture (Array Processors, Solomon, Illiac IV, BSP, TMC)BSP, TMC)

– Multiple-Instruction, Single-Data Stream (MISD) Multiple-Instruction, Single-Data Stream (MISD) Architecture (possibly: pipelined, VLIW, EPIC)Architecture (possibly: pipelined, VLIW, EPIC)

– Multiple-Instruction, Multiple-Data Stream Multiple-Instruction, Multiple-Data Stream Architecture (possibly: EPIC when SW-pipelined, Architecture (possibly: EPIC when SW-pipelined, true multiprocessor)true multiprocessor)

®®


52

MP Scalability CaveatMP Scalability Caveat

0.900.81

0.730.59

0.430.25

0.11

2 4 8 16 32 64 128

Performance gain from doubling processors

Number of processors

Gain

Gain Follows Law of Diminishing ReturnsGain Follows Law of Diminishing Returns

®®


53

Intel® Xeon™ Processor Scaling Intel® Xeon™ Processor Scaling 1.39x 1.39x FrequencyFrequency

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other source of information to evaluate the performance of systems or components they are considering purchasing.

Source: Intel CorporationBased on Intel internal projections. System configuration assumptions: 1) two Intel® Xeon™ processor 2.8GHz with 512KB L2 cache in an E7500 chipset-based server platform, 16GB memory, Hyperthreading enabled; 2) Four Intel® Xeon™ processor MP 1.6GHz with 1MB L3 cache in a GC-HE chipset-based server platform, 32GB memory, Hyperthreading enabled; 3) Four Intel® Xeon™ processor MP 2.0GHz with 2MB L3 cache in a GC-HE chipset-based server platform, 32GB memory, Hyperthreading enabled; 4) Four Intel® Xeon™ processor MP 2.8GHz with 2MB L3 cache in a GC-HE chipset-based server platform, 32GB memory, Hyperthreading enabled

1.001.25 1.40

1.001.31

1.68

(2P) 2.2GHz, 400MHz Bus,512KB cache

(2P) 3.06GHz,533MHz Bus,512KB cache

(2P) 3.06GHz,533MHz Bus,1MB cache

(2P) 2.2GHz, 400MHz Bus,512KB cache

(2P) 3.06GHz,533MHz Bus,512KB cache

(2P) 3.06GHz,533MHz Bus,1MB cache

OLTPSPECint_rate_base2000 Frequency Scale more visible with large cache

®®


54

Intel® Xeon™ MP vs. Xeon™ Relative Intel® Xeon™ MP vs. Xeon™ Relative OLTP PerformancesOLTP Performances

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other source of information to evaluate the performance of systems or components they are considering purchasing.

Source: TPC.org

1.00

2.00

(2P) Intel® Xeon™processor @ 2.8GHz,

533MHz Bus, 0 L3

(4P) Intel® Xeon™processor MP 2.0GHz,400MHz Bus 2MB L3

Which processor is better?

Xeon processor MP Targeted for OLTPXeon processor MP Targeted for OLTP

®®


55

MMX Integer OperationsMMX Integer Operations

Add (saturation)Add (saturation)padduswpadduswmm0, mm3mm0, mm3ppacked acked add add with with uunsigned nsigned ssaturation aturation

on on wwordsords

mm0

b1 b03000h b2

a1 a0F000h a2

a1+b1 a0+b0FFFFh a2+b2mm0

mm3

+ + + +

Add (wrap around)Add (wrap around) paddwpaddwmm0, mm3mm0, mm3 ppacked acked add add onon w words ords

mm0

b1 b03000h b2

a1 a0F000h a2

a1+b1 a0+b02000h a2+b2mm0

mm3

+ + + +

®®


56

Multiply-lowMultiply-lowpmullwpmullwmm0, mm3mm0, mm3

mulmultiply tiply llow, ow, wwordsords

Multiply-highMultiply-highpmulhwpmulhwmm1, mm4mm1, mm4

mulmultiply tiply hhigh, igh, wwordsords

MMX Arithmetic OperationsMMX Arithmetic Operations

mm1

b1 b0b3 b2

a1 a0a3 a2

c1 c0c3 c2mm1

* ** *

a3*b3 a2*b2

mm4

a1*b1 a0*b0

c1 c0c3 c2 c1 c0c3 c2

mm0

b1 b0b3 b2

a1 a0a3 a2

mm0

mm3* ** *

a3*b3 a2*b2 a1*b1 a0*b0

®®


57

MMX Arithmetic OperationsMMX Arithmetic Operations

Multiply AddMultiply Addpmaddwdpmaddwd mm1, mm4mm1, mm4

ppacked acked mmultiply and ultiply and addadd 4 4 wwords to 2 ords to 2 ddoublewordsoublewords

b3 b2 b1 b0

a3 a2 a1 a0

* ** *

a3*b3+a2*b2 a1*b1+a0*b0

mm1

mm1

mm4

a1*b1 a0*b0a3*b3 a2*b2

Note: This instruction does not have a saturation option.

®®


58

MMX Convert OperationsMMX Convert Operations

punpckhwd mm0, mm1unpack high words into doublewords

b0b1 a0a1

b1 b0b3 b2mm1 a1 a0a3 a2mm0

mm0

b2b3 a2a3

a1 a0a3 a2mm0

mm0

b1 b0b3 b2mm1

Unpack, interleaved merge Unpack, interleaved merge punpcklwdpunpcklwd mm0, mm1mm0, mm1unpunpaackck llowow wwords into ords into ddoublewordsoublewords

Zero extend from small data elements to bigger data elements by using the unpack instruction, with zeros in one of the operands.

®®


59

MMX Convert OperationsMMX Convert Operations

PackPackpackusdwpackusdw mm0, mm1mm0, mm1

packpack with with uunsigned nsigned ssaturation (signed) aturation (signed) ddoublewords into oublewords into wwordsords

mm1 A B C D

C’ D’A’ B’mm0

mm0

®®


60

8

psllw MM0, 8 packed shift left logical words

MM0

psllq MM0, 8 packed shift left logical quadword

MM0 703F 0000 FFD9 4364h

3F00 00FF D943 6400hMM0

81DBh 007Fh703Fh DF00h

DB00h 7F00h3F00h 0000hMM0

8

MMX Shift OperationsMMX Shift Operations

®®


61

MMX Compare OperationsMMX Compare Operations

pcmpgtwpcmpgtw ; ; ccoompmpare are ggreareatter er wwords (generate a mask)ords (generate a mask)

73 2 5 6

51 3 5 23

000...00 111...11 000...00 111...11

> > > >

®®


62

SSE RegistersSSE Registers

Eight 128 bit registersEight 128 bit registersSingle-precision / Double-precision Single-precision / Double-precision

/ 128-bit integer/ 128-bit integerDirect access to registersDirect access to registersReferred to as XMM0-XMM7Referred to as XMM0-XMM7Use simultaneously with FP /Use simultaneously with FP /

MMXMMX™ Technology TechnologyData array onlyData array only

IA-INT Registers

32

EAX

EDI

.

.

.

Fourteen 32-bit registersFourteen 32-bit registers Direct register access Direct register access Scalar Data onlyScalar Data only

Streaming SIMD Extension Registers(128-bit integer)

128

XMM0

XMM3

XMM4

XMM7

.

.

.

.

.

.

Eight 64 bit registersEight 64 bit registersXor eight 80 bit FP regsXor eight 80 bit FP regsDirect access to regsDirect access to regsFP data / data arrayFP data / data arrayx87 remains aliased with x87 remains aliased with

SIMD integer registersSIMD integer registersContext-switchContext-switch

MMX™ Technology / IA-FP Registers

6480

.

.

.

FP0 or MM0

FP7 or MM7

.

.

.

®®


63

SSE Arithmetic OperationsSSE Arithmetic Operations

ADD, SUB, MUL, DIV, SQRTADD, SUB, MUL, DIV, SQRT – Floating Point Floating Point

(Packed/Scalar) (Packed/Scalar) – Full 23 bit precisionFull 23 bit precision

RCPRCP - Reciprocal - Reciprocal

RSQRTRSQRT - Reciprocal - Reciprocal Square RootSquare Root

– Perspective correction / Perspective correction / projectionprojection

– Vector normalizationVector normalization– Very fast Very fast – Return at least 11 bits of Return at least 11 bits of

precisionprecision

Full PrecisionFull Precision Approximate PrecisionApproximate Precision

®®


64

SSE Arithmetic OperationsSSE Arithmetic Operations

MULPS: MULPS: MulMultiply tiply PPacked acked SSingle-FPingle-FP

mulpsmulps xmm1, xmm2 xmm1, xmm2

xmm1

xmm2

xmm1X4*Y4 X3*Y3 X2*Y2 X1*Y1

*X4 X3 X2 X1

Y4 Y3 Y2 Y1

®®


SSE Compare OperationSSE Compare Operation

CMPPS: CMPPS: CCoompmpare are PPacked acked SSingle-FPingle-FP

cmpps cmpps xmm0, xmm1, 1xmm0, xmm1, 1

xmm0

xmm1

xmm0111…11 000…00 111…11 000...00

<1.1 7.3 2.3 5.6

8.6 2.3 3.5 1.2

®®


66

SSE2 Registers, SSE2 Registers, look like SSE look like SSE cos they r cos they r

Eight 128 bit registersEight 128 bit registersSingle-precision array / Double-Single-precision array / Double-

precision array / 128-bit integerprecision array / 128-bit integerDirect access to registersDirect access to registersReferred to as XMM0-XMM7Referred to as XMM0-XMM7Use simultaneously with FP /Use simultaneously with FP /

MMXMMX™ Technology TechnologyData array onlyData array only

IA-INT Registers

32

EAX

EDI

.

.

.

Fourteen 32-bit registersFourteen 32-bit registers Direct register access Direct register access Scalar Data onlyScalar Data only

Streaming SIMD Extension Registers(scalar / packed SIMD-SP, SIMD-DP,

128-bit integer)

128

XMM0

XMM3

XMM4

XMM7

.

.

.

.

.

.

Eight 64 bit registersEight 64 bit registersXor eight 80 bit FP regsXor eight 80 bit FP regsDirect access to regsDirect access to regsFP data / data arrayFP data / data arrayx87 remains aliased with x87 remains aliased with

SIMD integer registersSIMD integer registersContext-switchContext-switch

MMX™ Technology / IA-FP Registers

6480

.

.

.

FP0 or MM0

FP7 or MM7

.

.

.

®®


67

SSE2 Register UseSSE2 Register Use

Backward compatible with all existing MMX™ & SSE codeBackward compatible with all existing MMX™ & SSE code

Cache ManagementCache Management(Memory Streaming/Prefetch)(Memory Streaming/Prefetch)

-AND--AND-

-OR--OR-

-AND--AND-

Instruction TypeInstruction Type

64-bit SIMD int. 64-bit SIMD int. (4x16, 8x8)(4x16, 8x8)

Single-precision SIMD FPSingle-precision SIMD FP(4x32)(4x32)Double-precision SIMD FPDouble-precision SIMD FP(2x64)(2x64)

Pen

tiu

m®

III

Pen

tiu

m®

III

Pro

cess

or

Pro

cess

or

128-bit SIMD int.128-bit SIMD int.(8x16, 16x8)(8x16, 16x8)

Will

amet

teW

illam

ette

Pro

cess

or

Pro

cess

or

Standard x87 (SP, DP, EP)Standard x87 (SP, DP, EP)

New 64-bit double-New 64-bit double-precision floating point precision floating point instructionsinstructions

New / enhanced 128-bit New / enhanced 128-bit wide SIMD integerwide SIMD integer

– Superset of MMX™ Superset of MMX™ technology instruction technology instruction setset

No forced context No forced context switching on SSE switching on SSE registers (unlike registers (unlike MMX™/x87 registers)MMX™/x87 registers)

®®


68

Willamette New InstructionsWillamette New Instructions

New InstructionsNew InstructionsExtended SIMD Integer InstructionsExtended SIMD Integer InstructionsNew SIMD Double-precision FP InstructionsNew SIMD Double-precision FP InstructionsNew Cacheability InstructionsNew Cacheability Instructions

Fully Integrated into Intel ArchitectureFully Integrated into Intel Architecture– Use previously Use previously reservedreserved opcodes opcodes

– Same addressing modes as MMX™ / SSE opsSame addressing modes as MMX™ / SSE ops

– Several MMX™ / SSE mnemonics are repeatedSeveral MMX™ / SSE mnemonics are repeated– New Extended SIMD functionality is obtained by New Extended SIMD functionality is obtained by

specifying 128-bit registers (xmm0-xmm7) as src/dst.specifying 128-bit registers (xmm0-xmm7) as src/dst.

®®


69

SIMD Double-Precision FP OpsSIMD Double-Precision FP Ops Same instruction categories as SIMD single-Same instruction categories as SIMD single-

precision FP instructionsprecision FP instructions Operate on both elements of packed data, in Operate on both elements of packed data, in

parallel -> SIMDparallel -> SIMD Some instructions have scalar or packed versionsSome instructions have scalar or packed versions

IEEE 754 Compliant FP ArithmeticIEEE 754 Compliant FP Arithmetic– Not bit exact with x87Not bit exact with x87: 80 bit internal vs 64 bit mem: 80 bit internal vs 64 bit mem

Usable in all modes: real, virtual x86, SMM, and Usable in all modes: real, virtual x86, SMM, and protected (16-bit & 32-bit)protected (16-bit & 32-bit)

X2 X1 / Scalar

S Exponent Significand005151525262626363

®®


70

FP Instruction SyntaxFP Instruction Syntax Arithmetic FP Instructions can be:Arithmetic FP Instructions can be:

– Packed or Scalar Packed or Scalar

– Single-Precision or Double-PrecisionSingle-Precision or Double-Precision

ASMASM IntrinsicsIntrinsics

addaddpsps _mm_add_ps()_mm_add_ps() Add Packed Single Add Packed Single

addaddpdpd _mm_add_pd()_mm_add_pd() Add Packed Double Add Packed Double

addaddss ss _mm_add_ss()_mm_add_ss() Add Scalar SingleAdd Scalar Single

addaddsd sd _mm_add_sd()_mm_add_sd() Add Scalar DoubleAdd Scalar Double

PPacked or acked or SScalarcalar

SSingle or ingle or DDoubleouble

®®


71

New SSE2 Data TypesNew SSE2 Data Types Packed & Scalar FP Instructions operate on packed Packed & Scalar FP Instructions operate on packed

single- or double-precisionsingle- or double-precision floating point elements floating point elements– Packed instructions operate on 4 (sp) or 2 (dp) floatsPacked instructions operate on 4 (sp) or 2 (dp) floats

– Scalar instructions operate only on the right-most fieldScalar instructions operate only on the right-most field

addaddppdd

X2opY2 X1opY1

X2 X1

Y2 Y1

op

addaddppss

X4opY4 X3opY3 X2opY2 X1opY1

X4 X3 X2 X1

Y4 Y3 Y2 Y1

op

addaddssss

Y4 Y3 Y2 X1opY1

X4 X3 X2 X1

Y4 Y3 Y2 Y1

op

addaddsdsd

Y2 X1opY1

X2 X1

Y2 Y1

op

®®


Extended SIMD Integer OpsExtended SIMD Integer Ops

All MMX™/SSE integer instructions operate on All MMX™/SSE integer instructions operate on 128-bit wide data in XMM registers128-bit wide data in XMM registers

Additionally, some new functionalityAdditionally, some new functionality– MOVDQA, MOVDQU: 128-bit aligned/unaligned movesMOVDQA, MOVDQU: 128-bit aligned/unaligned moves

– PADDQ, PSUBQ: 64-bit Add/Subtract for PADDQ, PSUBQ: 64-bit Add/Subtract for mmmm & & xmmxmm regs regs

– PMULUDQ: Packed 32 * 32 bit MultiplyPMULUDQ: Packed 32 * 32 bit Multiply

– PSLLDQ, PSRLDQ: 128-bit byte-wise ShiftsPSLLDQ, PSRLDQ: 128-bit byte-wise Shifts

– PSHUFD: Shuffle four double-words in PSHUFD: Shuffle four double-words in xmmxmm register register

– PSHUFL/HW: Shuffle four words in upper/lower half of PSHUFL/HW: Shuffle four words in upper/lower half of xmm xmm regreg

– PUNPCKL/HQDQ: Interleave upper/lower quadwordsPUNPCKL/HQDQ: Interleave upper/lower quadwords

– Full 128-bit Conversions: 4 Ints vs. 4 SP Floats Full 128-bit Conversions: 4 Ints vs. 4 SP Floats

®®


73

New 128-bit data-types for fixed-point integer dataNew 128-bit data-types for fixed-point integer data– 16 Packed bytes16 Packed bytes

– 8 Packed words8 Packed words

– 4 Packed doublewords4 Packed doublewords

– 2 Quadwords2 Quadwords

New SIMD Integer Data FormatsNew SIMD Integer Data Formats

127127 0015156363 1616

127127 006363

127127 00776363 88

127127 006363 313132 32

®®


74

New DP Instruction CategoriesNew DP Instruction Categories

ADD, SUB, MUL, DIV, SQRTADD, SUB, MUL, DIV, SQRT

MAX, MINMAX, MIN – Full 52-bit precision mantissa Full 52-bit precision mantissa

(Packed & Scalar) (Packed & Scalar)

AND, ANDN, OR, XORAND, ANDN, OR, XOR– Operate uniformly on entire Operate uniformly on entire

128-bit register 128-bit register – Must use DP instructions for Must use DP instructions for

double-precision datadouble-precision data

MOVAPD, MOVUPDMOVAPD, MOVUPD– 128-bit DP moves 128-bit DP moves

(aligned/unaligned)(aligned/unaligned)

MOVH/LPD, MOVSDMOVH/LPD, MOVSD – 64-bit DP moves64-bit DP moves

SHUFPDSHUFPD– Shuffle packed doublesShuffle packed doubles– Select data using 2-bit Select data using 2-bit

immediate operandimmediate operand

ComputationComputation Data FormattingData Formatting

LogicLogic

®®


75

DP Packed & Scalar OperationsDP Packed & Scalar Operations

The new Packed & Scalar FP Instructions The new Packed & Scalar FP Instructions operate on packed operate on packed double precision double precision floating floating point elementspoint elements– Packed instructions operate on 2 numbersPacked instructions operate on 2 numbers

– Scalar instructions operate on least-significant Scalar instructions operate on least-significant numbernumber

Y2 X1opY1

opX2 X1

Y2 Y1addaddsdsd

X2opY2 X1opY1

opX2 X

1Y2 Y1addaddpdpd

®®


76

y2-y1 x2-x1

SHUFPD: SHUFPD: ShufShuffle fle PPacked acked DDouble-FPouble-FP

SHUFPD InstructionSHUFPD Instruction

XMM1

XMM1

XMM2

SHUFPD XMM1, XMM2, 3SHUFPD XMM1, XMM2, 3 // binary 11// binary 11

SHUFPD XMM1, XMM2, 2SHUFPD XMM1, XMM2, 2 // binary 10// binary 10

1 0 01

XMM1

XMM1

x2 x1y2 y1

y2 x2

y2 x1

®®


77

New DP instruction Categories, Cont'dNew DP instruction Categories, Cont'd

CMPPD, CMPSDCMPPD, CMPSD– Compare & mask Compare & mask

(Packed/Scalar)(Packed/Scalar)

COMISDCOMISD – Scalar compare and set Scalar compare and set

status flagsstatus flags

MOVMSKPDMOVMSKPD– Store 2-bit mask of DP sign Store 2-bit mask of DP sign

bits in a bits in a reg32reg32

CVTCVT– Convert DP to SP & 32-Convert DP to SP & 32-

bit integer w/ rounding bit integer w/ rounding (Packed/Scalar)(Packed/Scalar)

CVTTCVTT– Convert DP to 32-bit Convert DP to 32-bit

integer w/ truncation integer w/ truncation (Packed/Scalar)(Packed/Scalar)

BranchingBranching Type ConversionType Conversion

®®


78

Compare & Mask OperationCompare & Mask Operation

CMPPD: CMPPD: CCoompmpare are PPacked acked DDouble-FPouble-FP

CMPPD CMPPD XMM0, XMM1, 1XMM0, XMM1, 1 // 1 = less than// 1 = less than

8.6 3.5

XMM0

XMM1

XMM0

< <1.1 12.3

1111111….111 0000000….000

®®


79

Cache EnhancementsCache Enhancements

On-die trace cache for decoded uops (TC)On-die trace cache for decoded uops (TC)– Holds 12K uopsHolds 12K uops

8K on-die, 18K on-die, 1stst level data cache (L1) level data cache (L1) – 64-byte line size64-byte line size

– Pentium Pro was 32 bytesPentium Pro was 32 bytes

– Ultrafast, multiple accesses per instructionUltrafast, multiple accesses per instruction

256K on-die, 2256K on-die, 2ndnd level write-back, unified data and level write-back, unified data and instruction cache (L2)instruction cache (L2)

– 128-byte line size128-byte line size

– operates at full processor clock frequencyoperates at full processor clock frequency

PREFETCH instructions return 128 bytes to L2PREFETCH instructions return 128 bytes to L2

Fas

ter

Fas

ter

®®


80

New Cacheability InstructionsNew Cacheability Instructions

MMX™/SSE cacheability instructions preservedMMX™/SSE cacheability instructions preserved New Functionality:New Functionality:

– CLFLUSH: Cache line flushCLFLUSH: Cache line flush

– LFENCE / MFENCE: Load Fence / Memory FenceLFENCE / MFENCE: Load Fence / Memory Fence

– PAUSE: Pause executionPAUSE: Pause execution

– MASKMOVDQU: Mask move 128-bit integer dataMASKMOVDQU: Mask move 128-bit integer data

– MOVNTPD: Streaming store with 2 64-bit DP FP dataMOVNTPD: Streaming store with 2 64-bit DP FP data

– MOVNTDQ: Streaming store with 128-bit integer dataMOVNTDQ: Streaming store with 128-bit integer data

– MOVNTI: Streaming store with 32-bit integer dataMOVNTI: Streaming store with 32-bit integer data

®®


81

Streaming StoresStreaming Stores

Willamette implementation supports:Willamette implementation supports:– Writing to uncacheable buffer (e.g. AGP) with Writing to uncacheable buffer (e.g. AGP) with

full line-writesfull line-writes

– Re-reading same buffer with full line-readsRe-reading same buffer with full line-reads

– New in WNI, compared to Katmai/CuMineNew in WNI, compared to Katmai/CuMine

Integer streaming storeInteger streaming store– Operates on integer registers (ie, EAX, EBX)Operates on integer registers (ie, EAX, EBX)

– Useful for OS, by avoiding need to save FP Useful for OS, by avoiding need to save FP state, just move raw bitsstate, just move raw bits

®®


82

Detail: Cache Line FlushDetail: Cache Line Flush

CLFLUSH: Cache line containing m8 flushed CLFLUSH: Cache line containing m8 flushed and invalidated from all caches in the and invalidated from all caches in the coherency domaincoherency domain

Linear address based; allowed by user codeLinear address based; allowed by user code Potential usage:Potential usage:

– Allows incoherent (AGP) I/O data to be mapped as Allows incoherent (AGP) I/O data to be mapped as WB for high read performance and flushed when WB for high read performance and flushed when updatedupdated

– Example: video encode streamExample: video encode stream

– Precise control of dirty data eviction may increase Precise control of dirty data eviction may increase performance by scheduling @ idle memory cyclesperformance by scheduling @ idle memory cycles

®®


83

Detail: FencesDetail: Fences Capabilities introduced over time to enable software managed Capabilities introduced over time to enable software managed

coherence:coherence:– Write combining with the Pentium Pro processorWrite combining with the Pentium Pro processor

– SFence and memory streaming with Streaming SIMD Extensions SFence and memory streaming with Streaming SIMD Extensions

New Willamette Fences completes the tool set to enable full New Willamette Fences completes the tool set to enable full software coherence management software coherence management

– LFence, strong load orderLFence, strong load order– Blocks younger loads from passing a prior load instructionBlocks younger loads from passing a prior load instruction

– All loads preceding an LFence will be completed before loads coming after All loads preceding an LFence will be completed before loads coming after the LFencethe LFence

– MFenceMFence– Achieves effect of LFence and SFence instructions executed at same timeAchieves effect of LFence and SFence instructions executed at same time

– Necessary, as issuing an SFence instruction followed by an LFence Necessary, as issuing an SFence instruction followed by an LFence instruction does not prevent a load from passing a prior storeinstruction does not prevent a load from passing a prior store

®®


84

Pause InstructionPause Instruction

PAUSE architecturally a NOP on IA-32 processor PAUSE architecturally a NOP on IA-32 processor generationsgenerations

Usable since Willamette!Usable since Willamette! Not necessary to check processor type.Not necessary to check processor type. PAUSE is hint to processor that code is a spin- wait or PAUSE is hint to processor that code is a spin- wait or

non- performance- critical code. A processor which non- performance- critical code. A processor which uses the hint can:uses the hint can:

– Significantly improves performance of spin-wait loops without Significantly improves performance of spin-wait loops without negative performance impact, by inserting a implementation- negative performance impact, by inserting a implementation- dependent delay that helps processors with dynamic dependent delay that helps processors with dynamic execution (a. k. a. out- of- order execution) exit from the spin- execution (a. k. a. out- of- order execution) exit from the spin- loop fasterloop faster

Significantly reduces power consumption during spin- Significantly reduces power consumption during spin- wait loopswait loops

®®


85

NetBurstNetBurstTMTM µµArchitecture OverviewArchitecture OverviewSystem Bus

2nd Level Cache

8-way

1st Level Cache (Data) 4-way

Bus Unit

Fetch/ Decode

Trace Cache

Microcode ROM

Frequently used paths

Less frequently used paths

Execution

Out-of-Order Core

Retirement

BTBs/Branch Prediction

Front End

L2 Cache and ControlL2 Cache and Control

FP

RF

FP

RF

FMulFMulFAddFAddMMXMMXSSESSE

FP moveFP moveFP storeFP store

3.2

GB

/s S

yste

m In

terf

ace

3.2

GB

/s S

yste

m In

terf

ace L2 Cache and ControlL2 Cache and Control

L1

D-C

ach

e an

d D

-TL

BL

1 D

-Cac

he

and

D-T

LB

StoreStoreAGUAGULoadLoadAGUAGU

Sch

edu

lers

Sch

edu

lers

Inte

ger

RF

Inte

ger

RF

ALUALU

ALUALU

ALUALU

ALUALU

Tra

ce C

ach

eT

race

Cac

he

Ren

ame/

Allo

cR

enam

e/A

lloc

uo

p Q

ueu

esu

op

Qu

eues

BTBBTB

uCodeuCodeROMROM

33 33

Dec

od

erD

eco

der

BT

B &

I-T

LB

BT

B &

I-T

LB

NetBurstNetBurstTMTM µµArchitectureArchitecture

®®


87

NetBurstNetBurstTMTM µµArchitecture SummaryArchitecture Summary

Quad Pumps bus to keep the Caches loadedQuad Pumps bus to keep the Caches loaded Stores most recent instructions as µops in TC to Stores most recent instructions as µops in TC to

enhance instruction issueenhance instruction issue Improves Program ExecutionImproves Program Execution

– Issues up to 3 µops per ClockIssues up to 3 µops per Clock

– Dispatches up to 6 µops to Execution Units per clockDispatches up to 6 µops to Execution Units per clock

– Retires up to 3 µops per clockRetires up to 3 µops per clock

Feeds back branch and data information to have Feeds back branch and data information to have required instructions and data availablerequired instructions and data available

®®


88

What is Hyperthreading?What is Hyperthreading? Ability of processor to run multiple threadsAbility of processor to run multiple threads

– Duplicate architecture state creates illusion to SW Duplicate architecture state creates illusion to SW of Dual Processor (DP)of Dual Processor (DP)

– Execution unit shared between two threads, but Execution unit shared between two threads, but dedicated if one stallsdedicated if one stalls

Effect of Hyperthreading on Xeon Processor:Effect of Hyperthreading on Xeon Processor:– CPU utilization increases to 50% (from ~35%)CPU utilization increases to 50% (from ~35%)– About 30% performance gain for some applications About 30% performance gain for some applications

with the same processor frequencywith the same processor frequency

Hyperthreading Technology Results:Hyperthreading Technology Results:1. More performance with enabled applications1. More performance with enabled applications

2. Better responsiveness with existing applications2. Better responsiveness with existing applications

®®


89

Hyperthreading ImplementationHyperthreading Implementation Almost two Logical ProcessorsAlmost two Logical Processors Architecture state (registers) Architecture state (registers)

and APIC duplicatedand APIC duplicated Share execution units, caches, Share execution units, caches,

branch prediction, control logic branch prediction, control logic and busesand buses

ProcessorExecutionResource

Adv. ProgrammableInterrupt Control

Architecture State

Adv. ProgrammableInterrupt Control

Architecture State

On-DieCaches

System Bus

**APIC: Advanced APIC: Advanced Programmable Interrupt Programmable Interrupt Controller. Handles Controller. Handles interrupts sent to a interrupts sent to a specified logical processorspecified logical processor

®®


90

1.21 1.19

1.00

HT OFF WebBench / WebServer Performance

Trade2 / Java AppsServer Performance

Benefits to Xeon™ ProcessorBenefits to Xeon™ Processor Hyperthreading Technology Performance for Dual Processor Hyperthreading Technology Performance for Dual Processor Servers Servers

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit http://www.intel.com/performance/resources/limits.htm or call (U.S.) 1-800-628-8686 or 1-916-356-3104.

Source: Veritest (Sep, 2002). Comparisons based on Intel internal measurements w/pre-production hardware1) HTT on and off configurations with the Intel® Xeon™ processor 2.80 GHz with 512KB L2 cache, Intel® Server Board SE7501WV2 with Intel® E7501 chipset, 2GB DDR, Microsoft Windows* 2000 Server SP2, Intel® PRO/1000 Gigabit Server adapter, AMI 438 MegaRAID* controller v1.48 16MB EDO RAM- Dell PowerVault 210S disk array.2) HTT on and off configurations with the Intel® Xeon™ processor 2.80 GHz with 512KB L2 cache, Intel® Server Board SE7501WV2 with Intel® E7501 chipset, 2GB DDR, Microsoft Windows* 2000 Server SP2, Intel® PRO/1000 Gigabit Server adapter, AMI 438 MegaRAID* controller v1.48 16MB EDO RAM- Dell PowerVault 210S disk array.

Enhancements in bandwidth, Enhancements in bandwidth, throughput and thread-level throughput and thread-level parallelism with parallelism with Hyperthreading Technology Hyperthreading Technology deliver an acceleration of deliver an acceleration of performanceperformance

Hyper Threading Technology Performance Gains

Intel® Xeon™ processor 2.8GHz with 512KB cache, Microsoft Windows* 2000

Hyperthreading Technology increases performance by Hyperthreading Technology increases performance by ~20% on Some Server Applications~20% on Some Server Applications

®®


91

Hyperthreading for WorkstationHyperthreading for Workstation

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit http://www.intel.com/performance/resources/limits.htm or call (U.S.) 1-800-628-8686 or 1-916-356-3104.

Source: Intel Corporation. With and without Hyperthreading Technology on the following system configuration: Intel Xeon Processor 2.80 GHz/533 MHz system bus with 512KB L2 cache, Intel® E7505 chipset-based Pre-Release platform, 1GB PC2100 DDR CL2 CAS2-2-2, (2) 18GB Seagate* Cheetah ST318452LW 15K Ultra160 SCSI hard drive using Adaptec 39160 SCSI adapter BIOS 3.10.0, nVidia* Quadro4 Pro 980XGL 128MB AGP 8x graphics card with driver version 40.52, Windows XP* Professional build 2600.

1.00

1.15 1.151.26 1.27 1.27

1.18 1.191.26 1.29

1.37

HT Off

Multi-Threaded Multi-Tasking

Intel® Xeon™ processor 2.8GHz with 512KB cacheHyperthreading Technology performance gains

• Performance gains Performance gains whether running:whether running: Multiple tasks within Multiple tasks within

one applicationone application Multiple applications Multiple applications

running at oncerunning at once

Multi-Multi-Threaded Threaded ApplicatioApplicationn

CHARMm*CHARMm* 3DSM*53DSM*5 D2cluster*D2cluster* BLAST*BLAST* LightwaveLightwave3D*753D*75

Multi-Multi-Tasking Tasking ApplicatioApplicationn

Patran* + Patran* + NastranNastran**

Multiple Multiple CompilesCompiles

3ds max* + 3ds max* + PhotoshopPhotoshop

Compile + Compile + RegressioRegressionn

Maya* Maya* multiple multiple renderings renderings + + AnimationAnimationHyperthreading Technology increases performance by Hyperthreading Technology increases performance by

15-37% on Workstation Applications15-37% on Workstation Applications

®®


92

Hyperthreading ResourcesHyperthreading Resources

TypeType DescriptionDescription ExampleExample

SharedShared Each logical processor can use, evict Each logical processor can use, evict or allocate any part of resourceor allocate any part of resource

Cache, WC Buffers, Cache, WC Buffers, VTune reg. MS-ROMVTune reg. MS-ROM

DuplicatedDuplicated Each logical processor has it’s own Each logical processor has it’s own set of resourcesset of resources

APIC, registers, TSC, APIC, registers, TSC, IPIP

SplitSplit Resources are hard partitioned in halfResources are hard partitioned in half Load/Store buffers, Load/Store buffers, ITLB, ROB, IAQITLB, ROB, IAQ

TaggedTagged Resource entries are tagged with Resource entries are tagged with processor IDprocessor ID

Trace Cache, DTLBTrace Cache, DTLB

®®


93

Xeon Processor PipelineXeon Processor PipelineSimplifiedSimplified Buffering Queues Buffering Queues

separate major pipeline separate major pipeline logic blocks logic blocks

Buffering queues are Buffering queues are either partitioned or either partitioned or duplicated to ensure duplicated to ensure independent forward independent forward progress through each progress through each logic block logic block

Buffering Queues duplicated

Buffering Queues partitioned

Queue Queue

Queue Queue

Queue

Queue

Queue

Fetch

Decode

TC/MSROM

Rename/Allocate

OOO Execute

Retirement

®®


94

HT in NetBurstHT in NetBurst

Front EndFront End– Execution Trace CacheExecution Trace Cache– Microcode Store ROM (MSROM)Microcode Store ROM (MSROM)– ITLB and Branch PredictionITLB and Branch Prediction– IA-32 Instruction DecodeIA-32 Instruction Decode– Micro-op QueueMicro-op Queue

Bus unit

3rd level cacheOptional server product

2nd level cache1st level cache

4 way

Fetch/Decode

Trace CacheMS ROM

OOO Execution

Retirement

BTBs/ Branch Prediction

System Bus

®®


95

Front EndFront End

Responsible for delivering instruction to the Responsible for delivering instruction to the later pipe stageslater pipe stages

Trace Cache HitTrace Cache Hit– When the requested instruction trace is present in When the requested instruction trace is present in

trace cachetrace cache

Trace cache missTrace cache miss– Requested instruction is brought in the trace cache Requested instruction is brought in the trace cache

from L2 cachefrom L2 cache

®®


96

Trace Cache HitTrace Cache HitFront EndFront End

Two separate instruction pointersTwo separate instruction pointers Two logical processors arbitrate for access to TC each cycleTwo logical processors arbitrate for access to TC each cycle If one logical processor stalls,other uses full bandwidth of If one logical processor stalls,other uses full bandwidth of

TCTC

IPIP

Trace Cache

Micro-Op Queue

®®


97

Programming ModelsProgramming Models Two major types of parallel programming Two major types of parallel programming

modelsmodels– Domain decompositionDomain decomposition– Functional decompositionFunctional decomposition

Domain DecompositionDomain Decomposition– Multiple threads working on subsets of the dataMultiple threads working on subsets of the data

Functional DecompositionFunctional Decomposition– Different computation on the same dataDifferent computation on the same data– E.g. Motion estimation vs. color conversion, e.t.c.E.g. Motion estimation vs. color conversion, e.t.c.

Both models can be implemented on HT processorsBoth models can be implemented on HT processors

®®


98

Threading ImplementationThreading Implementation O/S thread implementations may differO/S thread implementations may differ Microsoft Win32Microsoft Win32

– NT threads (supports 1-1 O/S level threading)NT threads (supports 1-1 O/S level threading)– Fibers (supports M-N user level threading)Fibers (supports M-N user level threading)

LinuxLinux– Native Linux Thread (severely broken & inefficient)Native Linux Thread (severely broken & inefficient)– IBM Next Generation Posix Threads (NGPT) – IBM’s attempt to fix IBM Next Generation Posix Threads (NGPT) – IBM’s attempt to fix

Linux native threadLinux native thread– Redhat Native Posix Thread Model for Linux (NPTL) -supports 1-Redhat Native Posix Thread Model for Linux (NPTL) -supports 1-

1 O/S level threading that is to be Posix compliant1 O/S level threading that is to be Posix compliant

OthersOthers– Pthreads (generic Posix compliant thread)Pthreads (generic Posix compliant thread)– Sun Solaris Light Weight Processes (lwp), Sun Solaris user level Sun Solaris Light Weight Processes (lwp), Sun Solaris user level

threadsthreads

Thread Model Issues Somewhat Orthogonal to HTThread Model Issues Somewhat Orthogonal to HT

®®


99

OS Implications of HTOS Implications of HT

ALL UP OS ALL UP OS

Legacy MP OSLegacy MP OSBackward Backward

Compatible,Compatible,

will not take the will not take the advantage ofadvantage of

Enabled MP OSEnabled MP OSOS with Basic OS with Basic

Hyperthreading Hyperthreading Technology Technology

FunctionalityFunctionality

Optimized MP Optimized MP OSOS

OS with optimized OS with optimized HyperthreadingHyperthreading

Technology supportTechnology support

Fully Compatible with ALL existing O/S… but only Fully Compatible with ALL existing O/S… but only optimized O/S enables the most benefitsoptimized O/S enables the most benefits

®®


100

HT Optimized OSHT Optimized OS

Windows XPWindows XP– Windows XPWindows XP

– Windows XP ProfessionalWindows XP Professional

Windows 2003Windows 2003– EnterpriseEnterprise

– Data CenterData Center

EnabledEnabled– RedHat Enterprise Server (version 7.3, 8.0)RedHat Enterprise Server (version 7.3, 8.0)

– RedHat Advanced Server 2.1RedHat Advanced Server 2.1

– Suse (8.0, 9.0)Suse (8.0, 9.0)

®®


101

OS SchedulingOS Scheduling HT enabled O/S sees two processors for each HT HT enabled O/S sees two processors for each HT

physical processorphysical processor– Enumerates first logical processor from all physical processors Enumerates first logical processor from all physical processors

firstfirst

Schedules processors almost same as regular SMPSchedules processors almost same as regular SMP– Thread priority determines schedule, Thread priority determines schedule, butbut CPU dispatch matters CPU dispatch matters

– O/S independently submits code stream for thread to logical O/S independently submits code stream for thread to logical processors and can independently interrupt or halt each logical processors and can independently interrupt or halt each logical processor (no change)processor (no change)

LogicalProcessor

1

LogicalProcessor

0

LogicalProcessor

1

LogicalProcessor

0

00000011 00000001 0000000000000010

Physical Processor 1 Physical Processor 0

CPUID CPUID CPUID CPUID

®®


102

Thread ManagementThread Management Avoid coding practices that disable hyperthreaded Avoid coding practices that disable hyperthreaded

processors, e.g.processors, e.g.– Avoid Avoid 64KB Aliasing64KB Aliasing– Avoid processor serializing events (e.g. FP denormals, self Avoid processor serializing events (e.g. FP denormals, self

modifying codes, e.t.c.)modifying codes, e.t.c.)

Avoid Avoid Spin LocksSpin Locks– Minimize lock contention to less than two threads per lockMinimize lock contention to less than two threads per lock– Use “Use “PausePause” and “” and “O/S synchronizationO/S synchronization” when Spin-Wait ” when Spin-Wait

loops must be implementedloops must be implemented

In addition, follow multi-threading best practices:In addition, follow multi-threading best practices:– Use O/S services to block waiting threadsUse O/S services to block waiting threads– Spin as briefly as possible before yielding to O/SSpin as briefly as possible before yielding to O/S– Avoid Avoid false sharingfalse sharing– Avoid unintended synchronizations (C Runtime, C++ Avoid unintended synchronizations (C Runtime, C++

Template Library implementations)Template Library implementations)

®®


103

Threading ToolsThreading Tools Intel ThreadChecker ToolIntel ThreadChecker Tool

– Itemization of parallelization bugs and sourceItemization of parallelization bugs and source

– ThreadChecker classThreadChecker class

OpenMPOpenMP– Thread model in which programmer introduces Thread model in which programmer introduces

parallelism or threading via directives or pragmasparallelism or threading via directives or pragmas

Intel Vtune AnalyzerIntel Vtune Analyzer– Provides analysis and drills down to source codeProvides analysis and drills down to source code

– ThreadChecker IntegrationThreadChecker Integration

GuideViewGuideView– Parallel performance tuningParallel performance tuning

®®


104

Software ToolsSoftware Tools Intel C/C++ CompilerIntel C/C++ Compiler

– Support for SSE and SSE2 using C++ classes, intrinsics, and assemblySupport for SSE and SSE2 using C++ classes, intrinsics, and assembly– Improved Vectorization and prefetch insertionImproved Vectorization and prefetch insertion– Profile-guided optimizationsProfile-guided optimizations– G7 compiler switch for Pentium® 4 optimizationsG7 compiler switch for Pentium® 4 optimizations

Register Viewing Tool (RVT)Register Viewing Tool (RVT)– Shows contents of XMM registers as they are updatedShows contents of XMM registers as they are updated– Plugs into Microsoft* Visual Studio*Plugs into Microsoft* Visual Studio*

Microsoft* Visual Studio* 6.0 Processor Pack* Microsoft* Visual Studio* 6.0 Processor Pack* Support for SSE and SSE2 instructions, including intrinsicsSupport for SSE and SSE2 instructions, including intrinsics Available for free download from Microsoft*Available for free download from Microsoft*

Microsoft* Visual Studio* .NET Microsoft* Visual Studio* .NET – Provides improved support for Intel® NetBurst™ micro-architectureProvides improved support for Intel® NetBurst™ micro-architecture– Recognizes XMM registersRecognizes XMM registers

®®


105

Hyperthreading is NOT:Hyperthreading is NOT: Hyperthreading is not a full, dual-core Hyperthreading is not a full, dual-core

processorprocessor Hyper-threading does not deliver multi-Hyper-threading does not deliver multi-

processor scalingprocessor scaling

Dual Processor Dual CoreHyper-Threading

Processorcore

APIC

Arch. State

APIC

Arch. State

On-DieCache

Processorcore

APIC

Arch. State

APIC

Arch. State

Cache Cache

Processorcore

APIC

Arch. State

APIC

Arch. State

On-DieCache

Processorcore

Processorcore

®®


106

BackupBackup

®®


107

TERMSTERMS Branch: transfer of control to address different from Branch: transfer of control to address different from

next instructionnext instruction. Unconditional or conditional.. Unconditional or conditional. Branch Prediction: Ability to guess target of Branch Prediction: Ability to guess target of

conditional branch. Can be wrong, in which case conditional branch. Can be wrong, in which case we have mis-predict.we have mis-predict.

CISC: complex instruction set computerCISC: complex instruction set computer Compiler: Tool translating high-level instructions Compiler: Tool translating high-level instructions

into low-level machine instructions. Can be asm into low-level machine instructions. Can be asm source (ASCII) or binary machine code.source (ASCII) or binary machine code.

EPIC (Explicitly Parallel Instruction Computing): EPIC (Explicitly Parallel Instruction Computing): New architecture jointly defined by IntelNew architecture jointly defined by Intel®® and HP.Is and HP.Is foundation of new 64-bit Instruction Set foundation of new 64-bit Instruction Set ArchitectureArchitecture

®®


108

TERMSTERMS

Explicit parallelism: Intended ability of two tasks to Explicit parallelism: Intended ability of two tasks to be executed by design (explicitly) at the same time. be executed by design (explicitly) at the same time. Task can be as simple as an instruction, or as Task can be as simple as an instruction, or as complex as a complete program.complex as a complete program.

Implicit parallelism: Incidental ability of two or more Implicit parallelism: Incidental ability of two or more tasks to be executed at the same time. Example: tasks to be executed at the same time. Example: sequence of integer add and FP convert sequence of integer add and FP convert instructions without common registers or memory instructions without common registers or memory addresses, executed on a target machine that addresses, executed on a target machine that happens to have respective HW modules available.happens to have respective HW modules available.

®®


109

TERMSTERMS Instruction Set Architecture (ISA): Architecturally Instruction Set Architecture (ISA): Architecturally

visible instructions that perform software functions visible instructions that perform software functions and direct operations within the processor. HP and and direct operations within the processor. HP and IntelIntel®® jointly developed a new 64-bit ISA.This ISA jointly developed a new 64-bit ISA.This ISA integrates technical concepts from the EPIC integrates technical concepts from the EPIC technology.technology.

Memory latency: Time to move data from memory Memory latency: Time to move data from memory to the processor, at request of processor.to the processor, at request of processor.

Mispredict: A wrong guess, where new flow of Mispredict: A wrong guess, where new flow of control will continue as a result of a branch (or control will continue as a result of a branch (or similar control flow instruction).similar control flow instruction).

Documents

® * Other brands and names may be claimed as the property of others. ECE 371 Microprocessors Chapter 6 Intel © x86 Microprocessor Architecture Derived