EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12

EECS 470Superscalar Architectures and the

Pentium 4Lecture 12

Optimizing CPU Performance

• Golden Rule: tCPU = Ninst*CPI*tCLK

• Given this, what are our options– Reduce the number of instructions executed– Reduce the cycles to execute an instruction– Reduce the clock period

• Our next focus: Further reducing CPI– Approach: Superscalar execution– Capable of initiating multiple instructions per cycle– Possible to implement for in-order or out-of-order

pipelines

Why Superscalar?

Pipelining Superscalar + Pipelining

• Optimization results in more complexity– Longer wires, more logic higher tCLK and tCPU

– Architects must strike a balance with reductions in CPI

Implications of Superscalar Execution

• Instruction fetch?– Taken branches, multiple branches, partial cache lines

• Instruction decode?– Simple for fixed length ISA, much harder for variable length

• Renaming?– Multi-port RT, inter-inst dependencies must be recognized

• Dynamic Scheduling?– Requires multiple results buses, smarter selection logic

• Execution?– Multiple functional units, multiple result buses

• Commit?– Multiple ROB/ARF ports, dependencies must be

recognized

P4 Overview

• Latest iA32 processor from Intel– Equipped with the full set of iA32

SIMD operations– First flagship architecture since

the P6 microarchitecture– Pentium 4 ISA = Pentium III ISA

+ SSE2– SSE2 (Streaming SIMD

Extensions 2) provides 128-bit SIMD integer and floating point operations + prefetch

Comparison Between Pentium III and Pentium 4

Execution Pipeline

Front End

• Predicts branches

• Fetches/decodes code into trace cache

• Generates ops for complex instructions

• Prefetches instructions that are likely to be executed

Branch Prediction

• Dynamically predict the direction and target of branches based on PC using BTB

• If no dynamic prediction available, statically predict– Taken for backwards looping branches– Not taken for forward branches– Implemented at decode

• Traces built across (predicted) taken branches to avoid taken branch penalties

• Also includes a 16-entry return address stack predictor

Decoder

• Single decoder available– Operates at a maximum of 1 instruction per cycle

• Receives instructions from L2 cache 64 bits at a time

• Some complex instructions must enlist the micro-ROM– Used for very complex iA32 instructions (> 4 ops)– After the microcode ROM finishes, the front-end

resumes fetching ops from the Trace Cache

Execution Pipeline

Trace Cache

• Primary instruction cache in P4 architecture– Stores 12k decoded ops

• On a miss, instructions are fetched from L2

• Trace predictor connects traces

• Trace cache removes– Decode latency after mispredictions

– Decode power for all pre-decoded instructions

Branch Hints

• P4 software can provide hints to branch prediction and trace cache– Specify the likely direction of a branch– Implemented with conditional branch prefixes– Used for decode-stage predictions and trace

building

Execution Pipeline

Execution Pipeline

Execution

• 126 ops can in flight at once– Up to 48 loads / 24 stores

• Can dispatch up to 6 ops per cycle

• 2x trace cache and retirement op bandwidth– Provides additional B/W for scheduling

mispeculation

Execution Units

Register Renaming

Register Renaming

• 8-entry architectural register file

• 128-entry physical register file

• 2 RAT (Front-end RAT and Retirement RAT)

• Retirement RAT eliminates register writes into ARF

Store and Load Scheduling

• Out of order store and load operations

Stores are always in program order

• 48 loads and 24 stores could be in flight

• Store/load buffers are allocated at the allocation stage– Total 24 store buffers and 48 load buffers

Execution Pipeline

Retirement

• Can retire 3 ops per cycle

• Implements precise exceptions

• Reorder buffer used to organize completed ops

• Also keeps track of branches and sends updated branch information to the BTB

Data Stream of Pentium 4 Processor

On-chip Caches

• L1 instruction cache (Trace Cache)• L1 data cache• L2 unified cache

– All caches use a pseudo-LRU replacement algorithm

• Parameters:

L1 Data Cache

• Non-blocking– Support up to 4 outstanding load misses

• Load latency– 2-clock for integer – 6-clock for floating-point

• 1 Load and 1 Store per clock• Load speculation

– Assume the access will hit the cache– “Replay” the dependent instructions when miss

detected

L2 Cache

• Non-blocking• Load latency

– Net load access latency of 7 cycles

• Bandwidth– 1 load and 1 store in one cycle– New cache operations may begin every 2

cycles– 256-bit wide bus between L1 and L2– 48Gbytes per second @ 1.5GHz

L2 Cache Data Prefetcher

• Hardware prefetcher monitors the reference patterns

• Bring cache lines automatically

• Attempts to fetch 256 bytes ahead of current access

• Prefetch for up to 8 simultaneous independent streams

System Bus

Deliver data with 3.2Gbytes/S

• 64-bit wide bus

• Four data phase per clock cycle (quad pumped)

• 100MHz clocked system bus

Execution on MPEG4 Benchmarks @ 1 GHz

Performance Trends

0.1

1

10

100

1000

10000

i386 i486 Pentium Pentium Pro Pentium II Pentium III Pentium 4 One Gen Two Gen Three Gen

Per

form

ance

(S

PE

CIn

t200

0)

Technology (relative FO4 delay)

Pipelining (relative FO4 gates/stage)

ILP (relative SPECInt/Mhz)

Performance

Moore's Law Speedup

PerformanceGap

Real-time speech10k SPECInt2000

Power Trends

0.1

1

10

100

1000

i386 i486 Pentium Pentium Pro Pentium II Pentium III Pentium 4 One Gen Two Gen Three Gen

Pow

er (W

)

Total Power (W)

Dynamic Power (W)

Static Power (W)

Real-time Speech500 mW Power

Power GapHot Plate

NuclearReactor

RocketNozzle

Documents

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12