Intel Pentium® 4 processor 1 MAMAS – Computer Architecture Pentium® 4 Processor Based on The Microarchitecture of the Pentium® 4 Processor – ITJ Q1/2001

Intel Pentium® 4 processor 1

MAMAS – Computer Architecture

Pentium® 4 Processor

Based on

The Microarchitecture of the Pentium® 4 Processor – ITJ Q1/2001Hyper-Threading Technology Architecture and Microarchitecture – ITJ 02/2002

The Microarchitecture of the Intel® Pentium® 4 Processor on 90nm Technology – ITJ 02/2004

Dr. Lihu Rappoport


Netburst® Microarchitecture Trace Cache Out-of-order core Rapid Execution Store-to-load forwarding enhancements


Pentium® 4

yesYesNoHT

533/800 MHz400/533/800 MHz400 MHzBus

MMX, SSE, SSE2, SSE3

MMX, SSE, SSE2

MMX, SSE, SSE2

Architecture

256KB

8KB 4-way

≤ 2.0 GHz

~70 Watt

217 mm2

180 nm

42M

Willamette

1MB512KBL2 cache

16KB 8-way 8KB 4-way L1 cache

2.8 – 3.8 GHz ≤ 3.4 GHz Freq

~100 Watt~50 WattPeak power

112 mm2 145 mm2 Die size

90 nm130 nm process

125M55Mtransistors

PrescottNorthwood


Pentium® 4 Block Diagram

3

L1 D-Cache (16KB, 8 Way) and D-TLB

Mem Sched

StoreAGU

LoadAGU

Integer RF / Bypass Network

Trace Cache (12K uops)

Rename/Alloc

Mem uop Q

uCodeROM

Decoder

Instruction TLB & Prefetcher

L2 Cache1MB

8 way

108 GB/sec

Uop Queue

FE BTB4K entries

TC BTB2K entries

Int/FP uop Queue

Slow/Gen FP Simple FPFast Sched

Fast ALU

SimpleInst

Fast ALU

SimpleInst

Slow ALU

ComplexInst

FmulFaddMMXSSE

FPMoveStore

FP RF / Bypass Network

256bit

BusInterface

UnitQuad

Pumped

6.4 GB/sec

SystemBus

64 bit


Block Diagram Explained BTB and iTLB

– determines next instructions to be fetched from L2 cache in case of a TC miss The Instruction decoder

– Takes bytes delivered from the L2 cache and decodes them into uops The Trace Cache

– Caches uops from the instruction decoder– Used as L1 Instruction cache– Delivers 3 uops/clock – The micro-code ROM has the complex micro-code sequences

Uop Queue– Holds uops from TC, ucode ROM, or decode logic – Decouples the FE from the OOO Execution Engine

The Renamer– Maps arch registers onto 128 deep physical register file

The Allocator – assigns all the necessary hardware buffers in the machine for this uop to execute

The Schedulers – determine when a uop is ready to execute

Execution Units– up to 4 integer arithmetic operations per clock cycle– 1 Floating point operation (including 128-bit SSE) per clock cycle– A memory load and store operation (up to 128 bit) each clock cycle

3.2 GB/sec system bus


1 2 3 4 5 6 7 8 9 10Fetch Fetch Decode Decode Decode Rename ROB Rd Rdy/Sch Dispatch Exec

Basic P6 PipelineBasic P6 Pipeline

Basic PentiumBasic Pentium®® 4 Processor Pipeline 4 Processor Pipeline1 2 3 4 5 6 7 8 9 10 11 12

TC Nxt IP TC Fetch Drive Alloc Rename Que Sch Sch Sch13 14

Disp Disp15 16 17 18 19 20

RF Ex Flgs Br Ck DriveRF

Intro at 733MHz 0.18µIntro at 733MHz 0.18µ

Intro at 1.4GHz 0.18µIntro at 1.4GHz 0.18µ


Trace Cache Decoding several IA-32 inst/clock at high frequency is difficult

– Instructions have a variable length and have many different options– Requires a significant amount of logic and intermediate state– Takes several pipeline stages, which adds to the branch mis-prediction penalty– The decode logic takes instruction bytes from the streaming buffers and decodes

them into uops The Trace-caches caches the uops of previously decoded instructions

– Bypasses the instruction decoder most of the time, and provides most of the uops– Decoding is only needed for instructions that miss the TC– Allows the decoder to be simplified: decodes 1 instruction per cycle– Reduces branch misprediction penalty

The Trace Cache is the primary (L1) instruction cache – Holds 12K uops– 8-way set associative with LRU replacement– Hit rate similar to that of an 8K~16KB conventional instruction cache– Delivers 3 uops/clock

Most uops are fetched and executed from the TC– Only when the TC misses, fetch & decode instructions from L2 cache


Trace Cache (cont.) Conventional instruction caches cannot provide instructions

before and after a taken branch in the same cycle– Fetch bandwidth is limited to basic blocks

The TC builds traces: program-ordered sequences of uops– Allows the target of a branch to be included in the same TC line as the

branch itself

Jump into the line

Jump out of the line

jmpjmp

jmpjmp jmpjmpjmpjmpjmpjmp


Trace Cache (cont.) Traces have variable length

– Broken into trace lines, six uops per trace line– There can be many trace lines in a single trace

The TC has its own branch predictor (Trace BTB)– Predicts branches that hit in the TC– Directs where instruction fetching needs to go next in the TC

Prescott encodes more types of uops inside the trace cache– If an instruction uses a uop that cannot be encoded in the trace cache

All uops for the instruction have to be sequenced from the Microcode ROM Requires transitions to the Microcode ROM, which reduces average FE BW

– New instructions that can now be encoded in the trace cache include Indirect calls with a register source operand software prefetch instructions


Microcode ROM Used for complex x86 instructions

– such as string move, and for fault and interrupt handling When a complex instruction is encountered, the TC jumps

into the microcode ROM After the microcode ROM finishes sequencing uops for the

current x86 instruction– The front end resumes fetching uops from the TC

Uops from the TC and the microcode ROM are buffered in a simple, in-order uop queue – helps smooth the flow of uops going to the out-of-order execution

engine


iTLB and Front-End BTB If there is a TC miss

– FE BTB predicts the next instruction IP– iTLB translates the next IP linear address to a physical

address Also performs page-level protection checking

– Instruction bytes are fetched from the L2 cache placed into streaming buffers, until they can be decoded

– Decoded into uops to be placed in the TC Hardware instruction pre-fetching logic

– Fetches x86 instruction bytes from the L2 cache that are predicted to be executed next

– Guided by the front-end BTB to know what to fetch next– Attempts to keep the instruction decoder fed with the next x86

instructions the program needs to execute


Static Branch Prediction at Decode If the decoder realizes that an instruction is a branch that was

missed by the BTB, a static branch prediction is made– Allows for a faster restart, than waiting for execution time detection

The simplest static prediction algorithm – Predict a backward branch to be taken – Predict a forward branch to be not taken

Correctly predicts taken for the first iteration of most loops Not all backwards branches are loop-ending branches

– if the distance between a backwards branch and its target is larger than some threshold, the branch is unlikely to be a loop-ending branch

– A backward branch is statically predict taken, only if the branch distance is less than this threshold

Branches with certain conditions are more often not taken, regardless of their direction and distance – The conditions that they used are not common loop-ending conditions– The static prediction for branches with these conditions is not taken


OOO Core: RAT and Alloc RAT renames architectural registers onto 128 physical regs

– Tracks the latest version of each architectural register– Tell the next instruction where to get its input operands

Allocator allocates uops from the uop queue into– 126 ROB entries, 128 integer and 128 floating-point physical registers– 48 load buffer entries – 32 store buffer entries (up from 24 in previous P4’s)

After allocation and renaming, uops are placed in one of 2 Qs– Memory instruction queue: for memory operations (loads and stores) – General instruction queue: for all other operations

The above two queues send uops to five scheduler queues– Each scheduler has its own 8–12 entry queue from which it selects

uops to be exe

RAT/Alloc

Uop Queue

Gen inst Q

Mem inst Q

Sched Q 1Sched Q 2Sched Q 3Sched Q 4Sched Q 5


Register renaming: P3 vs. P4 Pentium 3: On commit, registers are copied to RRF Pentium 4: On commit, Retirement RAT is updated


OOO Core – Schedulers Five uop schedulers schedule different uop types to

execution units– Collectively, they schedulers can dispatch up to 6 uops/cycle

Twice the rate at which the OOO core can receive uops from the Trace Cache

Allows higher flexibility to issue ready uops on the different ports Schedulers determine when uops are ready to

execute based on – The readiness of their dependent input register operands, and– The availability of the execution unit resources

The 5 schedulers connected to 4 dispatch ports– When multiple schedulers have uops ready to execute to the

same port, schedulers arbitrate for a dispatch port


OOO Core – Dispatch Ports Port 0

– 1st half of cycle: can dispatch a uop either to the FP Move unit or the fast ALU – 2nd half of cycle: can dispatch one uop to the fast ALU

Port 1– 1st half of cycle: can dispatch a uop either to complex FP/INT or to fast ALU – 2nd half of cycle: can dispatch one uop to the fast ALU

Port 2: can dispatch one load per cycle Port 3: can dispatch one store data per cycle

FP/MediaMove

FP/SSE: Reg MoveStore DataFXch

Load Store

Store AddrAll LoadsLEASW Prefetch

ALUdoublespeed

ALUdoublespeed

Add/SubLogicStore DataBranch

Add/SubSome of shift/ rotate

ComplexFP/

Media

FP/SSE AddFP/SSE MulFP/SSE DivMMX

ComplexInt

ShiftRotateInt Mul

ExecPort 0

ExecPort 1

LoadPort

StorePort


OOO Core – Retirement After execution, uops are placed in the ROB,

waiting to be retired– Retirement logic commits the architecture state in program

order Once stores have retired, store data needs to be

written into the L1 D-cache


Memory System Store instructions are broken up into two uops

– Store address: calculates the address of the store– Store data: stores the data into the Store Data buffer

The actual write to memory is done when the store retires Separating store-data & store-address is important for mem disambiguation

– Allows store-address to dispatch earlier, even before the stored data is known– Address conflicts resolved earlier opens the memory pipeline for other loads

store-data and store-address can be issued to execution units in parallel– Store-address dispatched to AGU when its sources (base and index reg) are ready– Store-data is dispatched to the store data buffer unit independently, when its source

operand is available Loads are scheduled asynchronously to store data uops

– A load can execute before a store data uop, even if it needs to get data from a store– The load would have to be re-executed after the store data uop had finally executed

This causes two problems – Can add latency because the minimum latency between a store data uop and a

dependent load was not the common case latency for loads that had been re-executed– Re-executing a load wastes precious load bandwidth

Added a simple predictor – Marks whether a load uop is likely to receive forwarded data, and, if so, from which store– A load that is predicted to forward, is held in the load scheduler until the store data uop

that produces the data it depends on is scheduled– Both of the performance penalties mentioned above are significantly reduced


SW prefetch instructions Get data into L2 cache before data actually used

– Opportunistically look up the L2 cache on a miss initiate a data prefetch cycle on the front-side bus

– Data filled only to L2 cache not to pollute the small L1 D- cache

If the prefetch access misses the DTLB– In previous P4’s: prefetch operations is dropped– In Prescott: initiate a page table walk and fill the DTLB – If the prefetch has a page fault, it is silently dropped

fault not reported to OS, and prefetch not performed


Hardware Prefetch Reduce the time waiting for DRAM

– Can prefetch both code and data streams– Data stream can be accessed by loads and/or stores

The hardware prefetcher looks for streams of data – Tries to predict what data will be needed next by the processor

and proactively tries to fetch these data Can be superior to software prefetching

– Requires no effort from the programmer – Can improve performance on code that has no software

prefetch instructions Prescott improved the prefetecher


Low Latency L1 Data Cache 16K-byte, 8-way set-associative, 64 bytes per cache line virtually addressed and physically tagged Write-through

– Writes to L1 D-cache are always copied into the L2 cache One load and one store per clock cycle Load latency is a key aspect of processor performance

– Especially for IA-32, that have many loads and stores due to small number of registers

The L1 data cache has 2-clock load-use latency for integer loads – The distance in clocks, from the load scheduler to execution, is longer than the load

execution latency itself– Minimize the load-to-use latency by using a partial virtual address match to detect

early in the pipeline whether a load is likely to hit or miss in the cache The scheduler assumes that the load will hit the L1 data cache

– It dispatches dependent operations before the parent load has finished executing This is a form of data speculation

– If the load misses the L1 data cache dependent operations that have started execution, use incorrect data A replay logic tracks and re-executes instructions that use incorrect data Only dependent operations are replayed: independent ops allowed to complete

– ≤4 outstanding L1 data cache load misses pending at any one time in the memory subsystem


Store-to-Load Forwarding Stores write to memory (L1 D-cache) only when they retire

– A long time can pass from when a store has valid data to when the store retires and data is written into the cache due to the deep pipeline

Stores that are ready to retire often have to wait for previous stores to retire If a load dependant on a Store would have to wait for a store to commit, before

it can get its data, there would be a significant performance reduction Each store is allocated an entry in a Store Forwarding Buffer (SFB)

– SFB has 32 entries ≤32 stores can be in the pipeline (in various stages)– Once the store data is known, it writes the data into its SFB entry

For a Load– Load address matched against all older stores in the SFB

Done in parallel with the load’s L1 D-cache access– If matches, the load gets its data directly from the SFB

Does need not need to wait for the store to retire Forwarding is allowed if

– The load hits the same address as a preceding completed store in the SFB– The load has the same size or is smaller than the pending store – The load have the same beginning physical address as the store


Store-to-Load Forwarding (cont.) Otherwise, the load gets its data from the cache

– Cannot complete until the store has committed its state to the cache Forwarding mechanism has same latency as cache lookup

– SFB don’t have time to do a full address match and access size check– The load address is only partially matches against all store addresses

Can cause wrong forwarding– Later in the pipeline, the MOB performs a full address match

Verifies the load got the data from the most recent dependent store If forwarding was incorrect, the load re-executes after the dependent

store writes to the L1 cache, and gets its data from the cache


Bus Memory requests not satisfied by the cache

hierarchy are serviced by the bus logic The bus logic includes

– The interrupt controller (APIC)– Off-chip system memory and I/O space

Bus logic also deals with– Snooping: cache coherency of requests originated by other

external bus agents– Incoming interrupt request delivery via the local APICs.


Hyper Threading Technology

Based onHyper-Threading Technology Architecture and Micro-architecture

Intel Technology Journal


Thread-Level Parallelism Multiprocessor systems have been used for many years

– There are known techniques to exploit multiprocessors Software trends

– Applications consist of multiple threads or processes that can be executed in parallel on multiple processors On-line transaction processing Web services

Threads can be – From the same application– From different applications running simultaneously– From operating system services– From operating system threads doing background maintenance

Thread-level parallelism (TLP) – Gains a better performance vs. transistor count and power ratio– Increasing single thread performance become harder and harder

and is less and less power efficient


Chip Multi-Processing (CMP) In CMP two processors are put on a single die

– Each processor has a full set of execution and architectural resources– The processors may or may not share a large on-chip cache

CMP is orthogonal to multi-processor systems – Can have multiple CMP processors in a multiprocessor configuration

A CMP chip is significantly larger than a single-core chip – And therefore more expensive to manufacture

However, process technology enables this– CMP chips are available today!


Multi-Threading Multi-threading: a single processor executes multiple threads Time-slice multithreading

– The processor switches between software threads after a fixed period– Can effectively minimize the effects of long latencies to memory – Might result in wasted execution slots

Switch-on-event multithreading – Switch threads on long latency events such as cache misses – Works well for server applications that have many cache misses and where

the two threads are executing similar tasks Both time-slice and switch-on-event do not optimally overlap the

usage of inefficiently used resource – E.g., branch mis-predictions, instruction dependencies

Simultaneous multi-threading (SMT)– Multiple threads execute on a single processor simultaneously w/o switching– Makes the most effective use of processor resources

Maximizes performance vs. transistor count and power


Hyper-threading (HT) Technology HT is SMT

– Makes a single processor appear as multiple logical processors– Each logical processor keeps its own copy of the architecture state

Takes a small fraction of the die Each logical processor has its own interrupt controller

– Interrupts sent to a specific logical processor are handled only by it The architecture state consists of registers including

– general-purpose registers– control registers– some machine state registers

From a software or architecture perspective– OS view the logical processors as physical processors

Schedule threads to logical processors as in a multiprocessor system From a micro-architecture perspective

– Logical processors share a single set of physical resources caches, execution units, branch predictors, control logic, and buses


Two Important Goals When one logical processor is stalled the other

logical processor can continue to make progress– A logical processor may be temporarily stalled for a variety of

reasons Cache misses, branch mispredictions, waiting for results from

previous instructions– Buffering queues separate major pipeline logic blocks– Independent progress ensured by either

Partitioning buffering queues and limit the number of entries each thread can use

Duplicate buffering queues A single active thread running on a processor with

HT runs at the same speed as without HT – Partitioned resources are recombined when only one thread is

active


A high-level view of the pipelineFe

tch

Dec

ode

Que

ueQ

ueue

TC /

MS-

RO

M

Que

ueQ

ueue

Ren

ame

/ Allo

cate

Que

ueQ

ueue

OO

O S

ched

uler

/ Ex

ecut

e

Que

ueQ

ueue

Ret

irem

ent

Que

ueQ

ueue

Arch State

Physical Registers

Arch State

APIC APIC

Shared Duplicated Partitioned


Front End Each logical processor manages its own next-

instruction-pointer

TC Hit TC Miss


TC and Ucode-ROM Logical processors arbitrate TC access every cycle

– If both want to access the TC access granted in alternating cycles

– If one logical processor is stalled or is unable to use the TC, the other logical processor can use the full TC bandwidth

TC entries are tagged with thread-ID – TC entries are dynamically allocated as needed– Allows one logical processor to have more entries than the

other Both logical processors share the ucode ROM

– Each manages its own ucode-ROM pointer– Ucode-ROM access alternates between logical processors

just as with the TC


BPU and IFU Branch prediction structures are either duplicated or shared

– The return stack buffer is duplicated Very small structure Call/Return pairs are better predicted per threads

– Global history is tracked for each logical processor– The large global history array is shared

Entries are tagged with a logical processor ID Each logical processor has its own ITLB The logic in charge of sending fetch requests to the L2 cache

– Arbitrates on a first-come first-served basis– Always reserving at least one request slot for each logical processor– Both logical processors can have fetches pending simultaneously

Two 64-byte streaming buffers per logical processor– Hold instruction bytes in preparation for instruction decode– small structures low die size cost of duplicating


Decoder and Uop Queue Both logical processors share the same decoder

logic– if only one needs the decode logic, it gets the full decode

bandwidth The state needed by the decodes is duplicated Decode several instructions for one logical

processor before switching to the other Decoded instructions are written into the TC and

forwarded to the uop queue Uop queue is hard partitioned

– Allows both logical processors to make independent forward progress regardless of FE stalls (e.g., TC miss) or EXE stalls


Out-of-order Execution Engine Allocation, register renaming, scheduling, execution


Allocator and RAT ROB and MOB are hard partitioned

– Enforce fairness and prevent deadlocks Allocator

– If there are uops for both logical processors in the uop queue Alternates between logical processors every cycle to assign resources

– If a logical processor reached the limit in one of the buffers Continues to assign resources for the other logical processor

– If the uop queue only contains uops for one logical processor Assigns resources for that logical processor every cycle the resource limits are still enforced

Register renaming– There are two RATs, one for each logical processor– Renaming done in parallel to the allocator logic

For the same uops to which the allocator is assigning resources– Registers are renamed to a shared physical register pool


Instruction Scheduling After allocation and renaming uops are placed in one of 2 Qs

– Memory instruction queue and general instruction queue– The two queues are hard partitioned– Send uops to the scheduler Qs as fast as they can

Alternating between the two logical processors every clock cycle The schedulers choose uops regardless of their logical

processor– The schedulers are oblivious to logical processor– Uops are evaluated only based on dependent inputs and availability of

execution resources– Uops from the two logical processors can be dispatched in the same

cycle To avoid deadlock and ensure fairness

– There is a limit on the number of active entries that a logical processor can have in each scheduler’s queue

– This limit is dependent on the size of the scheduler queue


Execution Units The execution core and memory hierarchy are also

largely oblivious to logical processors Registers renamed to a shared physical register

pool – Uops merely access the physical register file to get their

destinations– Write results back to the physical register file

Forwarding logic compares physical register numbers– Forward results to other uops without logical processors

knowledge


Retirement Retirement logic tracks when uops from the two

logical processors are ready retired– Retires uops in program order for each logical processor by

alternating between the two logical processors every cycle If one logical processor is not ready to retire uops

– All retirement bandwidth is dedicated to the other logical processor

Once stores have retired, store data is written into the L1 D$– Selection logic alternates between the two logical processors

to commit store data to the cache


Memory Subsystem Memory subsystem is largely oblivious to logical processors

– Schedulers send load or store uops regardless to logical processors– The memory subsystem handles them as they come

DTLB– The DTLB translates addresses to physical addresses– It has 64 fully associative entries; – each entry can map either a 4K or a 4MB page– Although the DTLB is a shared structure between the two logical processors, each

entry includes a logical processor ID tag– Each logical processor also has a reservation register to ensure fairness and

forward progress in processing DTLB misses L1 Data Cache, L2 Cache, L3 Cache

– Both logical processors share all entries in all 3 cache levels Regardless to which logical processor’s uops initially brought the data into the cache Potential for cache conflicts, which can result in lower performance

– There is also the possibility for sharing data in the cache One logical processor may prefetch instructions or data, needed by the other, into

the cache; Common in server application code in a producer-consumer model One logical processor may produce data that the other logical processor uses


Single-task And Multi-task Modes MT-mode (Multi-task mode)

– Two active logical processors – Some of the resources are partitioned as described earlier

ST-mode (Single-task mode)– Optimize performance for the case of a single thread– There are two flavors of ST-mode

single-task logical processor 0 (ST0) – only logical processor 0 is active single-task logical processor 1 (ST1) – only logical processor 1 is active

– Resources that were partitioned in MT-mode are re-combined to give the single active logical processor use of all of the resources

The HALT instruction – stops processor execution, and allows the processor to go into a lower power

mode– HALT is a privileged instruction

Only the OS or other ring-0 processes may execute this instruction Moving the processor from MT mode to ST0/ST1 mode

– Logical processor 0 executes HALT, move to ST1-mode– Logical processor 1 executes HALT, move to ST0-mode– If the remaining active logical processor also executes HALT, the physical

processor would then be able to go to a lower-power mode. Moving the processor from ST0/ST1 mode to MT mode

– Send an interrupt to the HALTed processor


Operating System And Applications An HT processor appears to the OS and application SW as 2

– The OS manages logical processors as it does physical processors scheduling runnable tasks or threads to logical processors

The OS should implement two optimizations: Use HALT if only one logical processor is active

– Allows the processor to transition to either the ST0 or ST1 mode – Otherwise the OS would execute on the idle logical processor a

sequence of instructions that repeatedly checks for work to do – This so-called “idle loop” can consume significant execution resources

that could otherwise be used by the other active logical processor Schedule threads to logical processors on different physical

processors before scheduling multiple threads to the same physical processor– Allows software threads to use different physical execution resources

when possible


HT in Prescott Many of Prescott enhancements motivated by HT performance

(Have little impact on the majority of single-threaded apps)– Increase num of outstanding loads that miss the L1 D-cache from 4 to 8– Increase the size of the queue between the front end and the Alloc/RAT– Work on a page table walk, and handle a memory access that splits a

cache line, in parallel rarely seen with single-thread, much more common with two threads

– Allow starting new page table walks, while handling a page table walk that missed all the caches and had to go to DRAM

Changes were also made to some of the thread selection points in order to improve overall bandwidth– The trace cache responds faster to stalling events in the core,

dedicating all of its resources to the thread that is not stalled, thereby generating better overall performance


Backup


Schedulers – Prescott Increased the size of the two schedulers used to

hold x87/SSE/SSE2/SSE3 uops – Increased the window of opportunity to find parallelism in

multimedia algorithms Increased the effective size of the queues that feed

all the schedulers– more uops can be buffered between the allocator and the

scheduler before the allocator has to stall – Allows Allocator and Renamer to continue to look ahead in the

instruction stream even when the schedulers are full


XOR reg, reg XOR register with itself used to zero the register

– Rather than to move an immediate of 0 into the register The result is logically equivalent Preferred because of the smaller resulting code size

– Adds a dependency on the previous contents of the register can result in a performance loss

Recognize when XOR is used in this manner– Removes the dependency on the source register

The result (0) does not depend on the value of the register– Prescott recognizes this for more XOR and SUB types


Memory System Write-combining buffers track streams of stores

– Increased from 6 to 8– Alleviates pressure on the number of stores that can be in the

machine simultaneously by allowing stores to be processed faster

Number of unique outstanding loads that have missed the L1 data cache and can be serviced has been increased from 4 to 8

Increase the size of L2 cache from 256/512 Kbytes to 1MB – 8-way set associative,128-byte lines, Writeback cache– physically addressed


Memory System Added a performance feature to enhance

SSE/SSE2/SSE3 instructions– On the x87 side, the FP Control Word (FCW) is often modified

When changing the rounding mode and precision– To avoid serializing the processor each time that the FCW is

modified, a simple predictor captures common renaming cases– This same idea is now extended to also handle the MXCSR,

which is the corollary of the FCW for instructions that use the SSE registers

– On prior implementations, changes to the MXCSR would serialize the machine

– Prescott avoids serialization also to common case modifications of MXCSR


Pentium® 4 Block Diagram


90nm Pentium® 4 Block Diagram


Force Forwarding Allow the MOB to control forwarding in the SFB Two new selection points added to the store-forwarding path

– The forwarding-entry-selection mux: allows the MOB to override the SFB’s partial address match-based entry selection

– The data alignment mux: allows for misaligned data to be rotated, based on the shift information provided by the MOB

The MOB can decide to re-execute a load in 2 cases– Partial address match caused incorrect dependency in the SFB– SFB failed to detect a dependency when a dependency did exist

If the SFB’s dependency check is wrong, the MOB can correct the forwarding logic when the load re-executes by directing the SFB in one of two ways:– forward to the load from the right entry and rotate the data as

necessary or – disable forwarding to the load if there is no dependent store in the

SFB


Force forwarding

MOB

StoreForwarding

Buffer(SFB)

AlignmentMux

Forw

ardi

ngEn

try

Mux

Forwarding entrySelection basedon partial virtualAddress match

Shift Control

Forced Forwarding Entry

Load’s forwarded data


Force forwarding Supported cases of misaligned forwarding

Misaligned address cases fixed by force forwarding

For each load at a given starting address, the data access sizes for which force forwarding is supported are listed

These cases can be categorized as follows:– DWord/QWord Store forwarding to Byte/Word loads whose data are fully

contained in either the lower or upper DWord– QWord Store forwarding to DWord Load to the upper DWord of the Store

For each of these cases, the MOB “forces” the SFB to forward from a specific store by a given shift amount in order to align the store’s data to the load


Force forwarding False forwarding occurs when

– SFB detects a partial address match between a load and a store, but

– their full addresses do not match MOB detects false forward condition

– Determines if there exists another store that the load should have forwarded from

– If a store exists that can be forwarded MOB directs the SFB to forward from this store entry using the

force forwarding mechanism when the load re-executes– If the MOB detects that there is no dependent store in the

forwarding buffer instructs the SFB to not forward to this load When the load is re-executed, it can then pick up its data from

the cache instead


Bus Requests from logical processors treated on first-come basis

– with queue and buffering space appearing shared– Priority is not given to one logical processor above the other

Bus queues distinct between requests from logical processors – Requests to the local APIC and interrupt delivery resources are unique

and separate per logical processor Bus logic also carries out portions of barrier fence and

memory ordering operations– applied to the bus request queues on a per logical processor basis

Logical processor ID is visibly sent onto the processor external bus in the request phase portion of a transaction– For debug purposes– As an aid to forward progress mechanisms in clustered multiprocessor

implementations– Other bus transactions, such as cache line eviction or prefetch

transactions, inherit the logical processor ID of the request that generated the transaction


HT – Context Identifier Bit The partial virtual address indexing used for the L1 cache creates conflicts

– When each logical processor’s access pattern matches the partial virtual tag even when accessing separate regions of physical memory

– E.g., the stacks of the 2 threads are offset by a fixed amount that is greater the size of the partial match, such that these 2 addresses, although different, alias to the same partial tag

– Causes contention in the cache, leading to a reduced cache hit rate A context identifier bit is added to the partial virtual tag

– Aimed at reducing the likelihood of contention in the cache– Dynamically set or reset based on the page-table initialization for each logical processor – Serves as an indication of data sharing intent across logical processors

If the two logical processors share the same page directory base in physical memory

– Gives a strong indication that data are intended to be shared between the logical processors

The context-identifier bit for each logical processor is set to the same value, allowing for sharing of the L1 data cache

If the page-directory bases are different– it is likely that both logical processors are working on separate data regions Sharing of the L1 data cache is disallowed by keeping the context-identifier bit different

across logical processors There may be uncommon cases where logical processors use different page

directory bases but still share the same physical memory region through page-table aliasing

– arise when two different page table entries across logical processors point to the same physical page frame

– The processor detects such cases and implements a reservation mechanism to prevent repetitive L1 cache access conflicts among different logical processors

Documents

Intel Pentium® 4 processor 1 MAMAS – Computer Architecture Pentium® 4 Processor Based on The Microarchitecture of the Pentium® 4 Processor – ITJ Q1/2001