9.1.0 Branch Prediction Pentiums IBM PPC

1

ECE – 684

Branch Prediction

http://www.extremetech.com/article2/0,1558,1155321,00.asp PC Processor Microarchitecture, additional references

2

ECE – 684

There really are three different kinds of branches: !  Forward conditional branches - based on a run-time condition, the PC (Program Counter) is changed to point to an address forward in the instruction stream.

!  Backward conditional branches - the PC is changed to point backward in the instruction stream. The branch is based on some condition, such as branching backwards to the beginning of a program loop when a test at the end of the loop states the loop should be executed again.

!  Unconditional branches - this includes jumps, procedure calls and returns that have no specific condition. For example, an unconditional jump instruction might be coded in assembly language as simply "jmp", and the instruction stream must immediately be directed to the target location pointed to by the jump instruction, whereas a conditional jump that might be coded as "jmpne" would redirect the instruction stream only if the result of a comparison of two values in a previous "compare" instructions shows the values to not be equal. (The segmented addressing scheme used by the x86 architecture adds extra complexity, since jumps can be either "near" (within a segment) or "far" (outside the segment). Each type has different effects on branch prediction algorithms.)

A Closer Look At Branch Prediction

3

ECE – 684

Static Branch Prediction predicts always the same direction for the same branch during the whole program execution. It comprises hardware-fixed prediction and compiler-directed prediction. Simple hardware-fixed direction mechanisms can be: • Predict always not taken • Predict always taken • Backward branch predict taken, forward branch predict not taken Sometimes a bit in the branch opcode allows the compiler to decide the prediction direction.

Static Branch Prediction

4

ECE – 684

Dynamic Branch Prediction: the hardware influences the prediction while execution proceeds. Prediction is decided on the computation history of the program. During the start-up phase of the program execution, where a static branch prediction might be effective, the history information is gathered and dynamic branch prediction gets effective. In general, dynamic branch prediction gives better results than static branch prediction, but at the cost of increased hardware complexity.

Dynamic Branch Prediction

5

ECE – 684

Forward branches dominate backward branches by about 4 to 1 (whether conditional or not). About 60% of the forward conditional branches are taken, while approximately 85% of the backward conditional branches are taken (because of the prevalence of program loops). Just knowing this data about average code behavior, we could optimize our architecture for the common cases. A "Static Predictor" can just look at the offset (distance forward or backward from current PC) for conditional branches as soon as the instruction is decoded. Backward branches will be predicted to be taken, since that is the most common case. The accuracy of the static predictor will depend on the type of code being executed, as well as the coding style used by the programmer. These statistics were derived from the SPEC suite of benchmarks, and many PC software workloads will favor slightly different static behavior.

Using Branch Statistics for Static Prediction

6

ECE – 684 Static Profile-Based Compiler Branch Misprediction Rates for SPEC92

Floating Point Integer

More Loops

Average 9% Average 15%

(i.e 91% Prediction Accuracy) (i.e 85% Prediction Accuracy)

7

ECE – 684

•  Dynamic branch prediction schemes are different from static mechanisms because they utilize hardware-based mechanisms that use the run-time behavior of branches to make more accurate predictions than possible using static prediction.

•  Usually information about outcomes of previous occurrences of branches (branching history) is used to dynamically predict the outcome of the current branch. Some of the proposed dynamic branch prediction mechanisms include:

–  One-level or Bimodal: Uses a Branch History Table (BHT), a table of usually two-bit saturating counters which is indexed by a portion of the branch address (low bits of address). (First proposed mid 1980s)

–  Two-Level Adaptive Branch Prediction. (First proposed early 1990s), –  MCFarling�s Two-Level Prediction with index sharing (gshare, 1993). –  Hybrid or Tournament Predictors: Uses a combinations of two or more

(usually two) branch prediction mechanisms (1993). •  To reduce the stall cycles resulting from correctly predicted taken branches to zero

cycles, a Branch Target Buffer (BTB) that includes the addresses of conditional branches that were taken along with their targets is added to the fetch stage.

Dynamic Conditional Branch Prediction

8

ECE – 684 How to further reduce the impact of branches on pipeline processor performance

Dynamic Branch Prediction:

Hardware-based schemes that utilize run-time behavior of branches to make dynamic predictions:

Information about outcomes of previous occurrences of branches are used to dynamically predict the outcome of the current branch. Why? Better branch prediction accuracy and thus fewer branch stalls

Branch Target Buffer (BTB): A hardware mechanism that aims at reducing the stall cycles resulting from correctly predicted taken branches to zero cycles.

9

ECE – 684

To refine our branch prediction, we could create a buffer that is indexed by the low-order address bits of recent branch instructions. In this BHB (sometimes called a "Branch History Table (BHT)"), for each branch instruction, we'd store a bit that indicates whether the branch was recently taken. A simple way to implement a dynamic branch predictor would be to check the BHB for every branch instruction. If the BHB's prediction bit indicates the branch should be taken, then the pipeline can go ahead and start fetching instructions from the new address (once it computes the target address). By the time the branch instruction works its way down the pipeline and actually causes a branch, then the correct instructions are already in the pipeline. If the BHB was wrong, a "misprediction" occurred, and we'll have to flush out the incorrectly fetched instructions and invert the BHB prediction bit.

Dynamic Branch Prediction with a Branch History Buffer (BHB)

10

ECE – 684 Dynamic Branch Prediction with a Branch History Buffer (BHB)

11

ECE – 684

It turns out that a single bit in the BHB will be wrong twice for a loop--once on the first pass of the loop and once at the end of the loop. We can get better prediction accuracy by using more bits to create a "saturating counter" that is incremented on a taken branch and decremented on an untaken branch. It turns out that a 2-bit predictor does about as well as you could get with more bits, achieving anywhere from 82% to 99% prediction accuracy with a table of 4096 entries. This size of table is at the point of diminishing returns for 2 bit entries, so there isn't much point in storing more. Since we're only indexing by the lower address bits, notice that 2 different branch addresses might have the same low-order bits and could point to the same place in our table--one reason not to let the table get too small.

Refining Our BHB by Storing More Bits

12

ECE – 684

There is a further refinement we can make to our BHB by correlating the behavior of other branches. Often called a "Global History Counter", this "two-level predictor" allows the behavior of other branches to also update the predictor bits for a particular branch instruction and achieve slightly better overall prediction accuracy. One implementation is called the "GShare algorithm". This approach uses a "Global Branch History Register" (a register that stores the global result of recent branches) that gets "hashed" with bits from the address of the branch being predicted. The resulting value is used as an index into the BHB where the prediction entry at that location is used to dynamically predict the branch direction. Yes, this is complicated stuff, but it's being used in several modern processors.

Two-Level Predictors and the GShare Algorithm

13

ECE – 684 Two-Level Predictors and the GShare Algorithm

Combined branch prediction* Scott McFarling proposed combined branch prediction in his 1993 paper 2. Combined branch prediction is about as accurate as local prediction, and almost as fast as global prediction. Combined branch prediction uses three predictors in parallel: bimodal, gshare, and a bimodal-like predictor to pick which of bimodal or gshare to use on a branch-by-branch basis. The choice predictor is yet another 2-bit up/down saturating counter, in this case the MSB choosing the prediction to use. In this case the counter is updated whenever the bimodal and gshare predictions disagree, to favor whichever predictor was actually right. On the SPEC'89 benchmarks, such a predictor is about as good as the local predictor. Another way of combining branch predictors is to have e.g. 3 different branch predictors, and merge their results by a majority vote. Predictors like gshare use multiple table entries to track the behavior of any particular branch. This multiplication of entries makes it much more likely that two branches will map to the same table entry (a situation called aliasing), which in turn makes it much more likely that prediction accuracy will suffer for those branches. Once you have multiple predictors, it is beneficial to arrange that each predictor will have different aliasing patterns, so that it is more likely that at least one predictor will have no aliasing. Combined predictors with different indexing functions for the different predictors are called gskew predictors, and are analogous to skewed associative caches used for data and instruction caching.

* From : http://en.wikipedia.org/wiki/Branch_prediction

14

ECE – 684

In addition to a large BHB, most predictors also include a buffer that stores the actual target address of taken branches (along with optional prediction bits). This table allows the CPU to look to see if an instruction is a branch and start fetching at the target address early on in the pipeline processing. By storing the instruction address and the target address, even before the processor decodes the instruction, it can know that it is a branch. A large BTB can completely remove most branch penalties (for correctly-predicted branches) if the CPU looks far enough ahead to make sure the target instructions are pre-fetched. Using a Return Address Buffer to predict the return from a subroutine One technique for dealing with the unconditional branch at the end of a subroutine is to create a buffer of the most recent return addresses. There are usually some subroutines that get called quite often in a program, and a return address buffer can make sure that the correct instructions are in the pipeline after the return instruction.

Using a Branch Target Buffer (BTB) to Further Reduce the Branch Penalty

15

ECE – 684 Branch Target Buffer (BTB)

•  Effective branch prediction requires the target of the branch at an early pipeline stage. (resolve the branch early in the pipeline)

•  One can use additional adders to calculate the target, as soon as the branch instruction is decoded. This would mean that one has to wait until the ID stage before the target of the branch can be fetched, taken branches would be fetched with a one-cycle penalty (this was done in the enhanced MIPS pipeline).

•  To avoid this problem one can use a Branch Target Buffer (BTB). A typical BTB is an associative memory where the addresses of taken branch instructions are stored together with their target addresses.

•  Some designs store n prediction bits as well, implementing a combined BTB and Branch history Table (BHT).

•  Instructions are fetched from the target stored in the BTB in case the branch is predicted-taken and found in BTB. After the branch has been resolved the BTB is updated. If a branch is encountered for the first time a new entry is created once it is resolved as taken.

•  Branch Target Instruction Cache (BTIC): A variation of BTB which caches also the code of the branch target instruction in addition to its address. This eliminates the need to fetch the target instruction from the instruction cache or from memory.

16

ECE – 684 BTB

17

ECE – 684 BTB Flow

Fetch

Decode

Execute

Prediction Output

18

ECE – 684 BTB Penalties

Branch Penalty Cycles Using A Branch-Target Buffer (BTB)

Assuming one more stall cycle to update BTB Penalty = 1 + 1 = 2 cycles

Base Pipeline Taken Branch Penalty = 1 cycle (i.e. branches resolved in ID)

No Not Taken Not Taken 0

19

ECE – 684 Dynamic Branch Prediction

•  Simplest method: (One-Level) –  A branch prediction buffer or Branch History Table (BHT) indexed by low

address bits of the branch instruction. –  Each buffer location (or BHT entry) contains one bit indicating whether the

branch was recently taken or not •  e.g 0 = not taken , 1 =taken

–  Always mispredicts in first and last loop iterations.

•  To improve prediction accuracy, two-bit prediction is used: –  A prediction must miss twice before it is changed.

•  Thus, a branch involved in a loop will be mispredicted only once when encountered the next time as opposed to twice when one bit is used.

–  Two-bit prediction is a specific case of n-bit saturating counter incremented when the branch is taken and decremented when the branch is not taken.

–  Two-bit prediction counters are usually always used based on observations that the performance of two-bit BHT prediction is comparable to that of n-bit predictors.

The counter (predictor) used is updated after the branch is resolved

Smith Algorithm

Why 2-bit Prediction?

. . .

BHT Entry: One Bit 0 = NT = Not Taken 1 = T = Taken

N Low Bits of Branch Address

20

ECE – 684 One-Level (Bimodal) Branch Predictors

•  One-level or bimodal branch prediction uses only one level of branch history.

•  These mechanisms usually employ a table which is indexed by lower N bits of the branch address.

•  Each table entry (or predictor) consists of n history bits, which form an n-bit automaton or saturating counters.

•  Smith proposed such a scheme, known as the Smith Algorithm, that uses a table of two-bit saturating counters. (1985)

•  One rarely finds the use of more than 3 history bits in the literature. •  Two variations of this mechanism:

–  Pattern History Table: Consists of directly mapped entries. –  Branch History Table (BHT): Stores the branch address as a tag.

It is associative and enables one to identify the branch instruction during IF by comparing the address of an instruction with the stored branch addresses in the table (similar to BTB).

21

ECE – 684

N Low Bits of

Table has 2N entries (also called predictors) . 0 0

0 1 1 0 1 1

High bit determines branch prediction 0 = NT = Not Taken 1 = T = Taken

Example: For N =12 Table has 2N = 212 entries = 4096 = 4k entries

Number of bits needed = 2 x 4k = 8k bits

Sometimes referred to as Decode History Table (DHT) or Branch History Table (BHT)

What if different branches map to the same predictor (counter)? This is called branch address aliasing and leads to interference with current branch prediction by other branches and may lower branch prediction accuracy for programs with aliasing.

Not Taken (NT)

Taken (T)

2-bit saturating counters (predictors)

Update counter after branch is resolved: -Increment counter used if branch is taken - Decrement counter used if branch is not taken

One-Level (Bimodal) Branch Predictors

22

ECE – 684

High bit determines branch prediction 0 = NT= Not Taken 1 = T = Taken

0 0 0 1 1 0 1 1

Not Taken (NT)

Taken (T)

2-bit saturating counters (predictors)

N Low Bits of

Branch History Table (BHT)

23

ECE – 684

11 10

01 00

Taken (T)

Not Taken (NT)

Basic Dynamic Two-Bit Branch Prediction:

Two-bit Predictor State Transition Diagram

Or Two-bit saturating counter predictor state transition diagram (Smith Algorithm):

0 0 0 1 1 0 1 1

Not Taken (NT)

Taken (T)

* From: New Algorithm Improves Branch Prediction Vol. 9, No. 4, March 27, 1995 © 1995 MicroDesign Resources

24

ECE – 684

Prediction Accuracy of A 4096-Entry Basic One-Level Dynamic Two-Bit Branch Predictor

Integer average 11% FP average 4%

Integer

Misprediction Rate:

(Lower misprediction rate due to more loops)

FP

N=12 2N = 4096

Has, more branches involved in IF-Then-Else constructs the FP

25

ECE – 684 MCFarling's gshare Predictor

•  McFarling noted (1993) that using global history information might be less efficient than simply using the address of the branch instruction, especially for small predictors.

•  He suggests using both global history (BHR) and branch address by hashing them together. He proposed using the XOR of global branch history register (BHR) and branch address since he expects that this value has more information than either one of its components. The result is that this mechanism outperforms GAp scheme by a small margin.

•  The hardware cost for k history bits is k + 2 x 2k bits, neglecting costs for logic.

gshare = global history with index sharing

gshare is one one the most widely implemented two level dynamic branch prediction schemes

26

ECE – 684 gshare Predictor

Branch and pattern history are kept globally. History and branch address are XORed and the result is used to index the pattern history table.

First Level:

Second Level:

XOR

(BHR)

2-bit saturating counters (predictors) Index the second level

gshare = global history with index sharing

Here: m = N = k

(bitwise XOR)

One Pattern History Table (PHT) with 2k entries (predictors)

(PHT)

27

ECE – 684 gshare Performance

gshare

(Gap) (One Level)

GAp One Level

GAp = Global, Adaptive, per address branch predictor

28

ECE – 684 Hybrid Predictors (Also known as tournament or combined predictors)

•  Hybrid predictors are simply combinations of two or more branch prediction mechanisms.

•  This approach takes into account that different mechanisms may perform best for different branch scenarios.

•  McFarling presented (1993) a number of different combinations of two branch prediction mechanisms.

•  He proposed to use an additional 2-bit counter selector array which serves to select the appropriate predictor for each branch.

•  One predictor is chosen for the higher two counts, the second one for the lower two counts.

•  If the first predictor is wrong and the second one is right the counter is decremented, if the first one is right and the second one is wrong, the counter is incremented. No changes are carried out if both predictors are correct or wrong.

29

ECE – 684 Intel Pentium 1

•  It uses a single-level 2-bit Smith algorithm BHT associated with a four way associative BTB which contains the branch history information.

•  The Pentium does not fetch non-predicted targets and does not employ a return address stack (RAS) for subroutine return addresses.

•  It does not allow multiple branches to be in flight at the same time. •  Due to the short Pentium pipeline the misprediction penalty is only

three or four cycles, depending on what pipeline the branch takes.

30

ECE – 684 Intel P6,II,III

•  Like Pentium, the P6 uses a BTB that retains both branch history information and the predicted target of the branch. However the BTB of P6 has 512 entries reducing BTB misses. Since the

•  The average misprediction penalty is 15 cycles. Misses in the BTB cause a significant 7 cycle penalty if the branch is backward.

•  To improve prediction accuracy a two-level branch history algorithm is used.

•  Although the P6 has a fairly satisfactory accuracy of about 90%, the enormous misprediction penalty should lead to reduced performance. Assuming a branch every 5 instructions and 10% mispredicted branches with 15 cycles per misprediction the overall penalty resulting from mispredicted branches is 0.3 cycles per instruction. This number may be slightly lower since BTB misses take only seven cycles.

31

ECE – 684 AMD K6

•  Uses a two-level adaptive branch history algorithm implemented in a BHT (gshare) with 8192 entries (16 times the size of the P6).

•  However, the size of the BHT prevents AMD from using a BTB or even storing branch target address information in the instruction cache. Instead, the branch target addresses are calculated on-the-fly using ALUs during the decode stage. The adders calculate all possible target addresses before the instruction are fully decoded and the processor chooses which addresses are valid.

•  A small branch target cache (BTC) is implemented to avoid a one cycle fetch penalty when a branch is predicted taken.

•  The BTC supplies the first 16 bytes of instructions directly to the instruction buffer.

•  Like the Cyrix 6x86 the K6 employs a return address stack (RAS) for subroutines.

•  The K6 is able to support up to 7 outstanding branches. •  With a prediction accuracy of more than 95% the K6 outperformed all other

microprocessors when introduced in 1997 (except the Alpha).

32

ECE – 684 Motorola PowerPC 750

•  A dynamic branch prediction algorithm is combined with static branch prediction which enables or disables the dynamic prediction mode and predicts the outcome of branches when the dynamic mode is disabled.

•  Uses a single-level Smith algorithm 512-entry BHT and a 64-entry Branch Target Instruction Cache (BTIC), which contains the most recently used branch target instructions, typically in pairs. When an instruction fetch does not hit in the BTIC the branch target address is calculated by adders.

•  The return address for subroutine calls is also calculated and stored in user-controlled special purpose registers.

•  The PowerPC 750 supports up to two branches, although instructions from the second predicted instruction stream can only be fetched but not dispatched.

33

ECE – 684 The SUN UltraSparc

•  Uses a dynamic single-level BHT Smith algorithm. •  It employs a static prediction which is used to initialize the state

machine (saturated up and down counters). •  However, the UltraSparc maintains a large number of branch

history entries (up to 2048 or every other line of the I-cache). •  To predict branch target addresses a branch following mechanism

is implemented in the instruction cache. The branch following mechanism also allows several levels of speculative execution.

•  The overall claimed performance of UltraSparc is 94% for FP applications and 88% for integer applications.

34

ECE – 684

Pentium Architecture

Excerpted from: The Pentium: An Architectural History of the World's Most

Famous Desktop Processor (Part I) By Jon Stokes

Sunday, July 11, 2004

35

ECE – 684 General Features

Introduction date: March 22, 1993 Process: 0.8 micron Transistor Count: 3.1 million Clock speed at introduction: 60 and 66 MHz Cache sizes: L1: 8K instruction, 8K data Features: MMX added in 1997

The Pentium's two-issue superscalar architecture was fairly straightforward. It had two five-stage integer pipelines, which Intel designated U and V, and one six-stage floating-point pipeline. The chip's front-end could do dynamic branch prediction (see Pentium 97 datasheet)

36

ECE – 684

37

ECE – 684 Pipeline

The Pentium's basic integer pipeline is five stages long, with the stages broken down as follows:

1. Prefetch/Fetch: Instructions are fetched from the instruction cache and aligned in prefetch buffers for decoding.

2. Decode1: Instructions are decoded into the Pentium's internal instruction format. Branch prediction also takes place at this stage.

3. Decode2: Same as above, and microcode ROM kicks in here, if necessary. Also, address computations take place at this stage.

4. Execute: The integer hardware executes the instruction.

5. Write-back: The results of the computation are written back to the register file.

38

ECE – 684 Pipeline

The main difference between the Pentium's five-stage pipeline and the four-stage pipelines prevalent at the time lies in the second decode stage.

RISC ISAs support only simple addressing modes, but x86's multiple complex addressing modes, which were originally designed to make assembly language programmers' lives easier but ended up making everyone's lives more difficult — require extra address computations.

These computations are relegated to the second decode stage, where dedicated address computation hardware handles them before dispatching the instruction to the execution units

39

ECE – 684 X86 Legacy support

! A whopping 30% of the Pentium's transistors were dedicated solely to providing x86 legacy support.

! The Pentium's entire front-end was bloated and distended with hardware that was there solely to support x86 (mis)features which were rapidly falling out of use

! Today, x86 support accounts for well under 10% of the transistors on the Pentium 4 — a drastic improvement over the original Pentium, and one that has contributed significantly to the ability of x86 hardware to catch up to and even surpass their RISC competitors in both integer and floating-point performance

40

ECE – 684 Pentium Pipeline

Block Diagram of Pipeline operations

The Pentium's U and V integer pipes were not fully symmetric. U, as the default pipe, was slightly more capable and contained a shifter, which V lacked

Floating-point, however, simply went from awful on the 486 to just mediocre with the Pentium — an improvement, to be sure, but not enough to make it even remotely competitive with comparable RISC chips on the market at that time

41

ECE – 684

The Pentium Pro did manage to raise the x86 performance bar significantly. Its out-of-order execution engine, dual integer pipelines, and improved floating-point unit gave it enough oomph to get x86 into the commodity server market.

Pentium Architectural improvements – P6

42

ECE – 684 The P6 architecture evolution

Pentium Pro vitals Pentium II vitals Pentium III vitals Introduction date November 1, 1995 May 7, 1997 February 26, 1999 Process 0.60/0.35 micron 0.35 micron 0.25 micron Transistor count 5.5 million 7.5 million 9.5 million Clock speed at introduction

150, 166, 180, and 200MHz 233, 266, and 300 MHz 450 and 500MHz

L1 cache size 8K instruction, 8K data 16K instruction, 16K data 16K instruction, 16K data L2 cache size 256K or 512K (on-die) 512K (off-die) 512K (on-die) Features No MMX MMX MMX, SSE, processor serial

number

43

ECE – 684 Pentium Pro Architecture

44

ECE – 684 Pentium Pro Architecture

45

ECE – 684 Decoupling the front end from the back end

In the Pentium and its predecessors, instructions traveled directly from the decoding hardware to the execution hardware. As noted, the Pentium had some hardwired rules (see next 3 slides) for dictating which instructions could go to which execution units and in what combinations, so once the instructions were decoded then the rules took over and the dispatch logic shuffled them off to the proper execution unit

The control unit is responsible for implementing and executing the rules that decide which instructions go where, and in what combinations.

This static, rules-based approach is rigid and simplistic, and it has two major drawbacks, both stemming from the fact that though the code stream is inherently sequential, a superscalar processor attempts to execute parts of it in parallel:

1. It adapts poorly to the dynamic and ever-changing code stream, and

2. It would make poor use of wider superscalar hardware.

46

ECE – 684 Pipeline Instruction Pairing Rules

•  Both instructions must be simple •  Hardwired, no microcode support •  Must execute in 1 clock cycle

•  No data dependencies between instructions (either memory or regs)

•  Neither instruction may contain both a displacement and an immediate value

•  Instructions w/prefixes can only be issued in the U-pipe •  Branches can only be the 2nd of a pair •  Must execute in V-pipe

47

ECE – 684 Pipeline Instruction Pairing Rules

Pseudocode :

IF I1 is simple AND I2 is simple AND I1 is not a jump AND dest. of I1 is not source of I2 AND dest. Of I1 is not dest of I2 THEN issue I1 to U-pipe Issue I2 to V-pipe ELSE issue I1 to U-pipe

48

ECE – 684 Efficiency of Instruction Pairing Rules

C-code :

for(k = I + prime; k <= SIZE; k += PRIME) Flags[k] = FALSE

Compiler assembly : ; PRIME in ecx, k in edx, FALSE in al Inner_loop

MOVE byte ptr flags[edx], al ADD edx, ecx CMP edx, FALSE JLE Inner_loop

Execution cycles 80486 Pentium 1 paired - 2 1 1 paired - 2 1 6 2

49

ECE – 684

Since the Pentium is a two-issue machine (i.e., it can issue at most two operations simultaneously from its decode hardware to its execution hardware on each clock cycle), then its dispatch rules look at only two instructions at a time to see if they can or cannot be dispatched simultaneously. If more execution hardware were added, and the issue width were increased to three instructions per cycle (as it is in the P6), then the rules determining which instructions go where would need to be able to account for various possible combinations of two and three instructions at a time, in order to get those instructions to the right execution unit at the right time. Furthermore, such rules would inevitably be difficult for coders to optimize for, and if they weren't to be overly complex then there would necessarily exist many common instruction sequences that would perform suboptimally under the default rule set. The makeup of the code stream would change from application to application and from moment to moment, but the rules responsible for scheduling the code stream's execution would be forever fixed.

Dispatch Issues

50

ECE – 684

1 - Instruction fetch. 2 - Instruction dispatch to an instruction queue (also called instruction buffer or reservation stations). 3 - The instruction waits in the queue until its input operands are available. The instruction is then allowed to leave the queue before earlier, older instructions. 4 - The instruction is issued to the appropriate functional unit and executed by that unit. 5 - The results are queued.

Out-of-order processing

1 - Instruction fetch. 2 - If input operands are available (in registers for instance), the instruction is dispatched to the appropriate functional unit. If one or more operands is unavailable during the current clock cycle (generally because they are being fetched from memory), the processor stalls until they are available. 3 - The instruction is executed by the appropriate functional unit. 4 - The functional unit writes the results back to the register file.

In-order Out-of-order

51

ECE – 684

Only after all older instructions have their results written back to the register file, then this result is written back to the register file. This is called the graduation or retire stage. The key concept of OoO processing is to allow the processor to avoid a class of stalls that occur when the data needed to perform an operation are unavailable. In the outline above, the OoO processor avoids the stall that occurs in step (2) of the in-order processor when the instruction is not completely ready to be processed due to missing data.


52

ECE – 684

OoO processors fill these "slots" in time with other instructions that are ready, then re-order the results at the end to make it appear that the instructions were processed as normal. The way the instructions are ordered in the original computer code is known as program order, in the processor they are handled in data order, the order in which the data, operands, become available in the processor's registers. Fairly complex circuitry is needed to convert from one ordering to the other and maintain a logical ordering of the output; the processor itself runs the instructions in seemingly random order. The benefit of OoO processing grows as the instruction pipeline deepens and the speed difference between main memory (or cache memory) and the processor widens. On modern machines, the processor runs many times faster than the memory, so during the time an in-order processor spends waiting for data to arrive, it could have processed a large number of instructions.


53

ECE – 684

The solution to the above dilemma is to place the newly decoded instructions in a buffer, and then issue them to the execution core whenever they're ready to be executed, even if that means executing them not just in parallel but in reverse order. This way, the current context in which a particular instruction finds itself executing can have much more of an impact on when and how it's executed. In replacing the control unit with a buffer, the P6 core replaces fixed rules with flexibility. The P6 architecture feeds each decoded instruction into a buffer called the reservation station (RS), where it waits until all of its execution requirements are met. Once they're met, the instruction then moves out of the reservation station into the proper execution unit, where it executes.

The reservation station

54

ECE – 684 The reorder buffer

After the instructions are decoded, they must travel through the reorder buffer (ROB) before flowing into the reservation station. The ROB is like a large log book in which the P6 can record all the essential information about each instruction that enters the execution core. The primary function of the ROB is to ensure that instructions come out one end of the out-of-order execution core in the same order in which they entered it. So newly decoded instructions flow into the ROB, where their relevant information is logged in one of 40 available entries. From there, they pass on to the reservation station, and then on to the execution core. Once they're done executing, their results go back to the ROB where they're stored until they're ready to be written back to the architectural registers. This final write-back, which is called retirement and which permanently alters the programmer-visible machine state, cannot happen until all of the instructions prior to the newly finished instruction have written back their results, a requirement which is necessary for maintaining the appearance of sequential execution

55

ECE – 684 The instruction window

A common metaphor for thinking about and talking about the P6's RS + ROB combination, or analogous structures on other processors, is that of an instruction window. The P6's ROB can track up to 40 instructions in various stages of execution, and its reservation station can hold and examine up to 20 instructions to determine the optimal time for them to execute. You can think of the reservation station's 20-instruction buffer as a window that moves along the sequentially ordered code stream; on any given cycle, the P6 is looking through this window at that visible segment of the code stream and thinking about how its hardware can optimally execute the 20 or so instructions that it sees there.

56

ECE – 684 Register renaming

Register renaming does for the data stream what the instruction window does for the code stream — it allows the processor some flexibility in adapting its resources to fit the needs of the currently-executing program. The x86 ISA has only eight general-purpose registers (GPRs) and eight floating-point registers (FPRs), a paltry number by today's standards (e.g., the PowerPC ISA specifies 32 of each register type), and a half to a quarter of what many of the P6's RISC contemporaries had. Register renaming allows a processor to have a larger number of actual registers than the ISA specifies, thereby enabling the chip to do more computations simultaneously without running out of registers. Each of the P6 core's 40 ROB entries has a data field, which holds program data just like an x86 register. These fields give the P6's execution core 40 microarchitectural registers to work with, and they're are used in combination with the P6's register allocation table (RAT) to implement register renaming in the P6 core

57

ECE – 684 The P6 execution core

The P6's execution core is significantly wider than that of the Pentium. Like the Pentium, it contains two symmetrical integer ALUs and a separate floating-point unit, but its load-store capabilities have been beefed up to include three execution units devoted solely to memory accesses: a load address unit, a store address unit, and a store data unit. The load address and store address units each contain a pair of four-input adders for calculating addresses and checking segment limits; these are the adders in the decode stage of the original Pentium. Up to five instructions per cycle can pass from the reservation station through the issue ports and into the execution units. This five issue-port structure is one of the most recognizable features of the P6 core, and when later designs (like the P-II) added execution units to the core (like MMX), they had to be added on one of the existing five issue ports.

58

ECE – 684 The P6 Pipeline

The P6 has a 12-stage pipeline — considerably longer than the Pentium's 5-stage pipeline. BTB access and instruction fetch: The first three and a half pipeline stages are dedicated to accessing the branch target buffer and fetching the next instruction. The P6's two-cycle instruction fetch phase is longer than the Pentium's 1-cycle fetch, but it keeps the L1 cache access latency from holding back the clock speed of the processor as a whole. Decode: The next two-and-a-half stages are dedicated to decoding x86 instructions and breaking them down into the P6's internal, RISC-like instruction format. Register rename: This stage takes care of register renaming and logging instructions in the ROB.

[Reservation Station] Write to RS: Moving instructions from the ROB to the RS takes one cycle, and occurs here. Read from RS: It takes another cycle to move instructions out of the RS, through the issue ports, and into the execution units. Execute: Instruction execution can take one cycle, as in the case of simple integer instructions, or multiple cycles, as in the case of floating-point instructions. Retire: These two final cycles are dedicated to writing the results of the instruction execution back into the ROB, and then retiring the instructions by writing their results from the ROB into the architectural register file.

59

ECE – 684 The P6 Pipeline

Lengthening the P6's pipeline as described above has two primary beneficial effects. First, it allows Intel to crank up the processor's clock speed, since each of the stages is shorter, simpler, and can be completed quicker; but this is fairly common knowledge. The second effect is a little more subtle and less widely appreciated. The P6's longer pipeline, when combined with its buffered decoupling of fetch/decode bandwidth from issue bandwidth, allows the processor to hide hiccups in the fetch and decode stages. In short, the nine pipeline stages that lie ahead of the execute stage combine with the RS to form a deep buffer for instructions, and this buffer can hide gaps and hang-ups in the flow of instructions in much the same way that a large UPS can hide fluctuations and gaps in the flow of electricity to a device or a large water reservoir can hide interruptions in the flow of water to a facility.

60

ECE – 684 General Branch Prediction

A branch predictor regards computer architecture and is the part of a processor that determines whether a conditional branch in the instruction flow of a program is likely to be taken or not. Branch predictors are crucial in today's modern, superscalar processors for achieving high performance. They allow processors to fetch and execute instructions without waiting for a branch to be resolved. Almost all pipelined processors do branch prediction of some form, because they must guess the address of the next instruction to fetch before the current instruction has been executed. Many earlier microprogrammed CPUs did not do branch prediction because there was little or no performance penalty for altering the flow of the instruction stream. Branch prediction is not the same as branch target prediction. Branch prediction attempts to guess whether a conditional branch will be taken or not. Branch target prediction attempts to guess the target of the branch or unconditional jump before it is computed from parsing the instruction itself.*

* WIkipedia definition

61

ECE – 684 Branch prediction on the P6

The P6 expended considerably more resources than its predecessor on branch prediction, and managed to boost dynamic branch prediction accuracy from the Pentium's ~75% rate to upwards of 90%. As we'll see when we look at the P4, branch prediction gets more important as pipelines get longer, because a pipeline flush due to a mispredict means more lost cycles. Consider the case of a conditional branch whose outcome depends on the result of an integer calculation. On the original Pentium, the calculation happens in the fourth pipeline stage, and if the branch prediction unit (BPU) has guessed wrongly then only three cycles worth of work would be lost in the pipeline flush. On the P6, though, the conditional calculation isn't performed until stage 10, which means 9 cycles worth of work gets flushed if the BPU guesses wrongly.

* WIkipedia definition

62

ECE – 684

The Pentium is a fifth-generation x86 architecture microprocessor from Intel, developed by Vinod Dham. It was the successor to the 486 line. The Pentium was expected to be named 80586 or i586, to follow the naming convention of previous generations. However, Intel was unable to convince a court to allow them to trademark a number (such as 486), in order to prevent competitors such as Advanced Micro Devices from branding their processors with similar names (such as AMD's Am486). Intel enlisted the help of Lexicon Branding to create a brand that could be trademarked. The Pentium brand was very successful, and was maintained through several generations of processors, from the Pentium Pro to the Pentium Extreme Edition. Although not used for marketing purposes, Pentium series processors are still given numerical product codes, starting with 80500 for the original Pentium chip.

Additional Pentium Facts – from Wikipedia

63

ECE – 684

P5, P54C, P54CS The original Pentium microprocessor had the internal code name P5 and the product code 80501 (80500 for the earliest steppings). This was a pipelined in-order superscalar microprocessor, produced using a 0.8 µm process. It was followed by the P54C (80502), a shrink of the P5 to a 0.6 µm process, which was dual-processor ready and had an internal clock speed different from the front side bus (it's much more difficult to increase the bus speed than to increase the internal clock). In turn, the P54C was followed by the P54CS, which used a 0.35 µm process - a pure CMOS process, as opposed to the Bipolar CMOS process that was used for the earlier Pentiums.

Additional Pentium Facts - from Wikipedia

64

ECE – 684

The early versions of 60-100 MHz Pentiums had a problem in the floating point unit that, in rare cases, resulted in reduced precision of division operations. This bug, discovered in Lynchburg, Virginia in 1994, became known as the Pentium FDIV bug (see article) and caused great embarrassment for Intel, which created an exchange program to replace the faulty processors with corrected ones.


65

ECE – 684 Pentium FDIV bug details

The Pentium FDIV bug is the most famous (or infamous) of the Intel microprocessor bugs. It was caused by an error in a lookup table that was a part of Intel's SRT algorithm that was to be faster and more accurate. With a goal to boost the execution of floating-point scalar code by 3 times and vector code by 5 times, compared to the 486DX chip, Intel decided to use the SRT algorithm that can generate two quotient bits per clock cycle, while the traditional 486 shift-and-subtract algorithm was generating only one quotient bit per cycle. This SRT algorithm uses a lookup table to calculate the intermidiate quotients necessary for floating-point division. Intel's lookup table consists of 1066 table entries, of which, due to a programming error, five were not downloaded into the programmable logic array (PLA). When any of these five cells is accessed by the floating point unit (FPU), it (the FPU) fetches zero instead of +2, which was supposed to be contained in the "missing" cells. This throws off the calculation and results in a less precise number than the correct answer(Byte Magazine, March 1995). At its worst, this error can occur as high as the fourth significant digit of a decimal number, but the possibilities of this happening are 1 in 360 billion. It is most common that the error appears in the 9th or 10th decimal digit, which yields a chance of this happening of 1 in 9 billion. Intel has clasified the bug (or the flaw, as they refer to it) with the following characteristics: • On certain input data, the FPDI (Floating Point Divide Instructions) on the Pentium processor produce inaccurate results. • The error can occur in any of the three operating precisions, namely single, double, or extended, for the divide instruction. However, it has been noted that far fewer failures are found in single precision than in double or extended precisions. • The incidence of the problem is independent of the processor rounding modes. • The occurrence of the problem is highly dependent on the input data. Only certain data will trigger the problem. There is a probability that 1 in 9 billion randomly fed divide or remainder instructions will produce inaccurate results. • The degree of inaccuracy depends on the input data and upon the instruction involved. • The problem does not occur on the specific use of the divide instruction to compute the reciprocal of the input operand in single precision. Furthermore, the bug affects any instruction that references the lookup table or calls FDIV. Related instructions that are affected by the bug are FDIVP, FDIVR, FDIVRP, FIDIV, FIDIVR, FPREM, and FPREM1. The instructions FPTAN and FPATAN are also susceptible. The instructions FYL2X, FYL2XP1, FSIN, FCOS, and FSINCOS, were a suspect but are now considered safe.

66

ECE – 684

A 3-D plot of the ratio 4195835/3145727 calculated on a Pentium with FDIV bug. The depressed triangular areas indicate where incorrect values have been computed. The

correct values all would round to 1.3338, but the returned values are 1.3337, an error in the fifth significant digit. Byte Magazine, March 1995. Intel has adopted a no-questions-asked replacement policy for its customers with the Pentium FDIV bug. It has also done

statistical reasearch and provided information on the bug at its site at intel.com

Pentium FDIV bug details

67

ECE – 684

The 60 and 66 MHz 0.8 µm versions of the Pentium processors were also known for their fragility and their (for the time) high levels of heat production - in fact, the Pentium 60 and 66 were often nicknamed "coffee warmers". They were also known as "high voltage Pentiums", due to their 5V operation. The heat problems were removed with the P54C, which ran at a much lower voltage (3.3V). P5 Pentiums used Socket 4, while P54C started out on Socket 5 before moving to Socket 7 in later revisions. All desktop Pentiums from P54CS onwards used Socket 7. Another bug known as f00f bug was discovered soon afterwards, but fortunately, operating system vendors responded by implementing workarounds that prevented the crash.


68

ECE – 684 Pentium (FDIV) Jokes

�� Q: What is Intel's follow-on to the Pentium? A: Repentium. The Pentium doesn't have bugs or produce errors; it's just Precision-Impaired. Q: How many Pentium designers does it take to screw in a light bulb? A: 1.99904274017, but that's close enough for non-technical people. Q: What's another name for the "Intel Inside" sticker they put on Pentiums? A: The warning label. Q: What do you call a series of FDIV instructions on a Pentium? A1: Successive approximations. A2: A random number generator. Q: Why didn't Intel call the Pentium the 586? A: Because they added 486 and 100 on the first Pentium and got 585.999983605.

69

ECE – 684

70

ECE – 684

71

ECE – 684

72

ECE – 684 Pentium Pro

The Pentium Pro micro-architecture is a three-way superscalar, pipelined architecture. The three-way superscalar architecture is capable of decoding, dispatching, and retireing three instructions per clock cycle. The Pentium Pro process family utilizes a decoupled 14-stage superpipeline that supports out-of-order instruction execution to facilitate the high level of instruction throughput. The Pentium Pro micro-architecture is illustrated in figure 3. The Pentium Pro micro-architecture pipeline is divided into four sections: the 1st level and 2nd level caches, the front end, the out-of-order execution core, and the retire section. The sections of the pipeline are supplied instructions and data through the bus interface unit.

!

The Pentium Pro processor micro-architecture utilizes two cache levels to provide a steady stream of instructions and data to the instruction execution pipeline. The L1 cache provides an 8-Kbyte instruction cache and an 8-Kbyte data cache, both closely coupled to the pipeline. The L2 cache is a 256-Kbyte, 512-Kbyte, 1-Mbyte, or 2-Mbyte static RAM that is coupled to the core processor through a full clock-speed 64-bit cache bus. The pipelined L2 cache connects to the processor via a 64-bit, full-frequency bus. The four-way set associative L2 cache employs 32-byte cache lines and contains 8 bits of error correcting code for each 64 bits of data. The nonblocking L1 and L2 caches permit multiple cache misses to proceed in parallel; cache hits proceed in parallel; cache hits proceed during outstanding cache misses to other addresses.

73

ECE – 684 Steppings

74

ECE – 684

75

ECE – 684

76

ECE – 684

77

ECE – 684

78

ECE – 684

79

ECE – 684

NetBurst is the name Intel gave to the new architecture that succeeded its P6 microarchitecture. The concept behind NetBurst was to improve the throughput, improve the efficiency of the out-of-order execution engine, and to create a processor that can reach much higher frequencies with higher performance relative to the P5 and P6 microarchitectures, while maintaining backward compatibility. Initially launched in Intel’s seventh-generation Pentium 4 processors (the Willamette core) in late 2000, the NetBurst architecture represented the biggest change to the IA-32 architecture since the Pentium Pro in 1995. One of the most important changes was to the processor’s internal pipeline, referred to as Hyper Pipeline. This comprised 20 pipeline stages versus the ten for the P6 microarchitecture and was instrumental in allowing the processor to process more instructions per clock and to operate at significantly higher clock speeds than its predecessor.

80

ECE – 684

The NetBurst microarchitecture has only one decoder (as opposed to the three in the P6 microarchitecture), and the out of order execution unit now has the execution trace cache that stores decoded ops. The core’s ability to execute instructions out of order remains a key factor in enabling parallelism, several buffers being employed to smooth the flow of ops, and longer pipelines and the improved out-of-order execution engine allow the processor to achieve higher frequencies, and improve throughput. Ultimately, the NetBurst microarchitecture was to prove to be something of a disappointment in comparison to Intel’s mobile-processor technology. It was therefore not entirely surprising when it transpired that NetBurst’s successor would build on the energy-efficient philosophy adopted in Intel’s mobile microarchitecture and embodied in its Pentium M family of processors.

81

ECE – 684

82

ECE – 684

83

ECE – 684

84

ECE – 684

85

ECE – 684 A Detailed Look Inside the Intel® NetBurst™ Micro-Architecture of the Intel Pentium® 4 Processor

Page 9

Intel® NetBurst™ Micro-architectureThe Pentium® 4 processor is the first hardware implementation of a new micro-architecture, the Intel NetBurstmicro-architecture. To help reader understand this new micro-architecture, this section examines in detail thefollowing:ß the design considerations the Intel NetBurst micro-architecture

ß the building blocks that make up this new micro-architecture

ß the operation of key functional units of this micro-architecture based on the implementation in the Pentium 4processor.

The Intel NetBurst micro-architecture is designed to achieve high performance for both integer and floating-pointcomputations at very high clock rates. It has the following features:ß hyper pipelined technology to enable high clock rates and frequency headroom to well above 1GHz

ß rapid execution engine to reduce the latency of basic integer instructions

ß high-performance, quad-pumped bus interface to the 400 MHz Intel NetBurst micro-architecture system bus.

ß execution trace cache to shorten branch delays

ß cache line sizes of 64 and 128 bytes

ß hardware prefetch

ß aggressive branch prediction to minimize pipeline delays

ß out-of-order speculative execution to enable parallelism

ß superscalar issue to enable parallelism

ß hardware register renaming to avoid register name space limitations

The Design Considerations of the Intel® NetBurstTM Micro-architectureThe design goals of Intel NetBurst micro-architecture are: (a) to execute both the legacy IA-32 code and applicationsbased on single-instruction, multiple-data (SIMD) technology at high processing rates; (b) to operate at high clockrates, and to scale to higher performance and clock rates in the future. To accomplish these design goals, the IntelNetBurst micro-architecture has many advanced features and improvements over the Pentium Pro processor micro-architecture.

The major design considerations of the Intel NetBurst micro-architecture to enable high performance and highlyscalable clock rates are as follows:ß It uses a deeply pipelined design to enable high clock rates with different parts of the chip running at different

clock rates, some faster and some slower than the nominally-quoted clock frequency of the processor. TheIntel NetBurst micro-architecture allows the Pentium 4 processor to achieve significantly higher clock rates ascompared with the Pentium III processor. These clock rates will achieve well above 1 GHz.

ß Its pipeline provides high performance by optimizing for the common case of frequently executedinstructions. This means that the most frequently executed instructions in common circumstances (such as acache hit) are decoded efficiently and executed with short latencies, such that frequently encountered codesequences are processed with high throughput.

ß It employs many techniques to hide stall penalties. Among these are parallel execution, buffering, andspeculation. Furthermore, the Intel NetBurst micro-architecture executes instructions dynamically and out-or-order, so the time it takes to execute each individual instruction is not always deterministic. Performance of aparticular code sequence may vary depending on the state the machine was in when that code sequence wasentered.

86

ECE – 684

A Detailed Look Inside the Intel® NetBurst™ Micro-Architecture of the Intel Pentium® 4 Processor

Page 10

Overview of the Intel® NetBurstTM Micro-architecture PipelineThe pipeline of the Intel NetBurst micro-architecture contain three sections:ß the in-order issue front end

ß the out-of-order superscalar execution core

ß the in-order retirement unit.

The front end supplies instructions in program order tothe out-of-order core. It fetches and decodes IA-32instructions. The decoded IA-32 instructions aretranslated into micro-operations (µops). The front end’sprimary job is to feed a continuous stream of µops tothe execution core in original program order.

The core can then issue multiple µops per cycle, andaggressively reorder µops so that those µops, whoseinputs are ready and have execution resources available,can execute as soon as possible. The retirement sectionensures that the results of execution of the µops areprocessed according to original program order and thatthe proper architectural states are updated.

Figure 3 illustrates a block diagram view of the majorfunctional blocks associated with the Intel NetBurstmicro-architecture pipeline. The paragraphs that followFigure 3 provide an overview of each of the threesections in the pipeline.

The Front End

The front end of the Intel NetBurst micro-architecture consists of two parts:ß fetch/decode unit

ß execution trace cache.

The front end performs several basic functions:ß prefetches IA-32 instructions that are likely to be executed

ß fetches instructions that have not already been prefetched

ß decodes instructions into µops

ß generates microcode for complex instructions and special-purpose code

ß delivers decoded instructions from the execution trace cache

ß predicts branches using highly advanced algorithm.

The front end of the Intel NetBurst micro-architecture is designed to address some of the common problems in high-speed, pipelined microprocessors. Two of these problems contribute to major sources of delays:ß the time to decode instructions fetched from the target

ß wasted decode bandwidth due to branches or branch target in the middle of cache lines.

The execution trace cache addresses both of these problems by storing decoded IA-32 instructions. Instructions arefetched and decoded by a translation engine. The translation engine builds the decoded instruction into sequences of

Fetch/Decode Trace CacheMicrocode ROM

ExecutionOut-Of-Order Core

Retirement

1st Level Cache4-way

2nd Level Cache 8-Way

BTBs/Branch Prediction

Bus Unit

System BusFrequently used paths

Less frequently used paths

Front End

3rd Level Cache Optional, Server Product Only

Branch History Update

Figure 3 The Intel® NetBurstTM Micro-architecture

87

ECE – 684


Page 12

Prefetching

The Intel NetBurst micro-architecture supports three prefetching mechanisms:ß the first is for instructions only

ß the second is for data only

ß the third is for code or data.

The first mechanism is hardware instruction fetcher that automatically prefetches instructions. The second is asoftware-controlled mechanism that fetches data into the caches using the prefetch instructions. The third is ahardware mechanism that automatically fetches data and instruction into the unified second-level cache.

The hardware instruction fetcher reads instructions along the path predicted by the BTB into the instructionstreaming buffers. Data is read in 32-byte chunks starting at the target address. The second and third mechanisms isdescribed in Data Prefetch.

Decoder

The front end of the Intel NetBurst micro-architecture has a single decoder that can decode instructions at themaximum rate of one instruction per clock. Complex instruction must enlist the help of the microcode ROM. Thedecoder operation is connected to the execution trace cache discussed in the section that follows.

Execution Trace Cache

The execution trace cache (TC) is the primary instruction cache in the Intel NetBurst micro-architecture. The TCstores decoded IA-32 instructions, or µops. This removes decoding costs on frequently-executed code, such astemplate restrictions and the extra latency to decode instructions upon a branch misprediction.

In the Pentium 4 processor implementation, the TC can hold up to 12K µops and can deliver up to three µops percycle. The TC does not hold all of the µops that need to be executed in the execution core. In some situations, theexecution core may need to execute a microcode flow, instead of the µop traces that are stored in the trace cache.

The Pentium 4 processor is optimized so that most frequently-executed IA-32 instructions come from the tracecache, efficiently and continuously, while only a few instructions involve the microcode ROM.

Branch Prediction

Branch prediction is very important to the performance of a deeply pipelined processor. Branch prediction enablesthe processor to begin executing instructions long before the branch outcome is certain. Branch delay is the penaltythat is incurred in the absence of a correct prediction. For Pentium 4 processor, the branch delay for a correctlypredicted instruction can be as few as zero clock cycles. The branch delay for a mispredicted branch can be manycycles; typically this is equivalent to the depth of the pipeline.

The branch prediction in the Intel NetBurst micro-architecture predicts all near branches, including conditional,unconditional calls and returns, and indirect branches. It does not predict far transfers, for example, far calls, irets,and software interrupts.

In addition, several mechanisms are implemented to aid in predicting branches more accurately and in reducing thecost of taken branches:ß dynamically predict the direction and target of branches based on the instructions’ linear address using the

branch target buffer (BTB)ß if no dynamic prediction is available or if it is invalid, statically predict the outcome based on the offset of the

target: a backward branch is predicted to be taken, a forward branch is predicted to be not takenß return addresses are predicted using the 16-entry return address stack

ß traces of instructions are built across predicted taken branches to avoid branch penalties.

88

ECE – 684 A Detailed Look Inside the Intel® NetBurst™ Micro-Architecture of the Intel Pentium® 4 Processor

Page 13

The Static Predictor . Once the branch instruction is decoded, the direction of the branch (forward or backward) isknown. If there was no valid entry in the BTB for the branch, the static predictor makes a prediction based on thedirection of the branch. The static prediction mechanism predicts backward conditional branches (those withnegative displacement), such as loop-closing branches, as taken. Forward branches are predicted not taken.

Branch Target Buffer. Once branch history is available, the Pentium 4 processor can predict the branch outcomebefore the branch instruction is even decoded, based on a history of previously-encountered branches. It uses abranch history table and a branch target buffer (collectively called the BTB) to predict the direction and target ofbranches based on an instruction’s linear address. Once the branch is retired, the BTB is updated with the targetaddress.

Return Stack. Returns are always taken, but since a procedure may be invoked from several call sites, a singlepredicted target will not suffice. The Pentium 4 processor has a Return Stack that can predict return addresses for aseries of procedure calls. This increases the benefit of unrolling loops containing function calls. It also mitigates theneed to put certain procedures inline since the return penalty portion of the procedure call overhead is reduced.

Even if the direction and target address of the branch are correctly predicted well in advance, a taken branch mayreduce available parallelism in a typical processor, since the decode bandwidth is wasted for instructions whichimmediately follow the branch and precede the target, if the branch does not end the line and target does not beginthe line. The branch predictor allows a branch and its target to coexist in a single trace cache line, maximizinginstruction delivery from the front end.

Branch Hints

The Pentium 4 processor provides a feature that permits software to provide hints to the branch prediction and traceformation hardware to enhance their performance. These hints take the form of prefixes to conditional branchinstructions. These prefixes have no effect for pre-Pentium 4 processor implementations. Branch hints are notguaranteed to have any effect, and their function may vary across implementations. However, since branch hints arearchitecturally visible, and the same code could be run on multiple implementations, they should be inserted only incases which are likely to be helpful across all implementations.

Branch hints are interpreted by the translation engine, and are used to assist branch prediction and trace constructionhardware. They are only used at trace build time, and have no effect within already-built traces. Directional hintsoverride the static (forward-taken, backward-not taken) prediction in the event that a BTB prediction is notavailable. Because branch hints increase code size slightly, the preferred approach to providing directional hints isby the arrangement of code so that

(i) forward branches that are more probable should be in the not-taken path, and

(ii) backward branches that are more probable should be in the taken path. Since the branch prediction informationthat is available when the trace is built is used to predict which path or trace through the code will be taken,directional branch hints can help traces be built along the most likely path.

Execution Core DetailThe execution core is designed to optimize overall performance by handling the most common cases mostefficiently. The hardware is designed to execute the most frequent operations in the most common context as fast aspossible, at the expense of less-frequent operations in rare context. Some parts of the core may speculate that acommon condition holds to allow faster execution. If it does not, the machine may stall. An example of this pertainsto store forwarding. If a load is predicted to be dependent on a store, it gets its data from that store and tentativelyproceeds. If the load turned out not to depend on the store, the load is delayed until the real data has been loadedfrom memory, then it proceeds.

Instruction Latency and Throughput

The superscalar, out-of-order core contains multiple execution hardware resources that can execute multiple µops inparallel. The core’s ability to make use of available parallelism can be enhanced by:

89

ECE – 684


Page 13

The Static Predictor . Once the branch instruction is decoded, the direction of the branch (forward or backward) isknown. If there was no valid entry in the BTB for the branch, the static predictor makes a prediction based on thedirection of the branch. The static prediction mechanism predicts backward conditional branches (those withnegative displacement), such as loop-closing branches, as taken. Forward branches are predicted not taken.

Branch Target Buffer. Once branch history is available, the Pentium 4 processor can predict the branch outcomebefore the branch instruction is even decoded, based on a history of previously-encountered branches. It uses abranch history table and a branch target buffer (collectively called the BTB) to predict the direction and target ofbranches based on an instruction’s linear address. Once the branch is retired, the BTB is updated with the targetaddress.

Return Stack. Returns are always taken, but since a procedure may be invoked from several call sites, a singlepredicted target will not suffice. The Pentium 4 processor has a Return Stack that can predict return addresses for aseries of procedure calls. This increases the benefit of unrolling loops containing function calls. It also mitigates theneed to put certain procedures inline since the return penalty portion of the procedure call overhead is reduced.

Even if the direction and target address of the branch are correctly predicted well in advance, a taken branch mayreduce available parallelism in a typical processor, since the decode bandwidth is wasted for instructions whichimmediately follow the branch and precede the target, if the branch does not end the line and target does not beginthe line. The branch predictor allows a branch and its target to coexist in a single trace cache line, maximizinginstruction delivery from the front end.

Branch Hints

The Pentium 4 processor provides a feature that permits software to provide hints to the branch prediction and traceformation hardware to enhance their performance. These hints take the form of prefixes to conditional branchinstructions. These prefixes have no effect for pre-Pentium 4 processor implementations. Branch hints are notguaranteed to have any effect, and their function may vary across implementations. However, since branch hints arearchitecturally visible, and the same code could be run on multiple implementations, they should be inserted only incases which are likely to be helpful across all implementations.

Branch hints are interpreted by the translation engine, and are used to assist branch prediction and trace constructionhardware. They are only used at trace build time, and have no effect within already-built traces. Directional hintsoverride the static (forward-taken, backward-not taken) prediction in the event that a BTB prediction is notavailable. Because branch hints increase code size slightly, the preferred approach to providing directional hints isby the arrangement of code so that

(i) forward branches that are more probable should be in the not-taken path, and

(ii) backward branches that are more probable should be in the taken path. Since the branch prediction informationthat is available when the trace is built is used to predict which path or trace through the code will be taken,directional branch hints can help traces be built along the most likely path.

Execution Core DetailThe execution core is designed to optimize overall performance by handling the most common cases mostefficiently. The hardware is designed to execute the most frequent operations in the most common context as fast aspossible, at the expense of less-frequent operations in rare context. Some parts of the core may speculate that acommon condition holds to allow faster execution. If it does not, the machine may stall. An example of this pertainsto store forwarding. If a load is predicted to be dependent on a store, it gets its data from that store and tentativelyproceeds. If the load turned out not to depend on the store, the load is delayed until the real data has been loadedfrom memory, then it proceeds.

Instruction Latency and Throughput

The superscalar, out-of-order core contains multiple execution hardware resources that can execute multiple µops inparallel. The core’s ability to make use of available parallelism can be enhanced by:

90

ECE – 684 IA- 32 Architecture

Richard Eckert Anthony Marino Matt Morrison Steve Sonntag

91

ECE – 684 IA-32 Overview

•  IA-32 Overview –  Pentium 4 / Netburst µArchitecture –  SSE2

•  Hyper Pipeline –  Overview –  Branch Prediction

•  Execution Types –  Rapid Execution Engine –  Advanced Dynamic Execution

•  Memory Management –  Segmentation –  Paging –  Virtual Memory

•  Address Modes / Instruction Format –  Address Translation

•  Cache –  Levels of Cache (L1 & L2) / Execution Trace Cache –  Instruction Decoder –  System Bus

•  Register Files –  Enhanced Floating Point & Multi-Media Unit

•  Summary / Conclusion

92

ECE – 684 IA-32 Background

•  Traced to 1969 –  Intel 4004

•  P4 –  1st IA-32 processor based on Intel Netburst microprocessor.

•  Netburst –  Allows

•  Higher Performance Levels •  Performance at Higher Clock Speeds

•  Compatible with existing applications and operating systems –  Written to run on Intel IA-32 architecture Processors

93

ECE – 684 1st Implementation of Intel Netburst µArchitecture

•  Rapid Execution Engine •  Hyper Pipelined

Technology •  Advanced Dynamic

Execution •  Innovative Cache

Subsystem

•  Streaming SIMD Extensions 2 (SSE2)

•  400 MHz System Bus

94

ECE – 684 Netburst µArchitecture

95

ECE – 684 SSE2

•  Internet Streaming SIMD Extensions 2 (SSE2) – What is it?

– What does it do?

– How is this helpful?

96










97

ECE – 684 Hyper Pipelined

•  What is hyper pipeline technology? –  Deeper pipeline –  Fewer gates per pipeline stage

•  What are the benefits of hyper pipeline? –  Increased clock rate –  Increased performance

98

ECE – 684 Netburst™ vs. P6

1 Fetch

2 Fetch

3 Decode

4 Decode

5 Decode

6 Rename

7 ROB Rd

8 Rdy/Sch

9 Dispatch

10 Exec

3 4 TC Fetch

5 Drive

6 Alloc

9 Que

10 Sch

12 Sch

13 Disp

14 Disp

15 RF

16 RF

17 Ex

18 Flgs

19 BrCk

20 Drive

1 2 TC Nxt IP

7 8 Rename

11 Sch

Typical P6 Pipeline

Typical Pentium 4 Pipeline

99

ECE – 684

3.2 GB

/s System

Interface

L2 Cache and Control

BTB

BTB

& I-TLB

Decoder

Trace Cache

Renam

e/Alloc

µop Queues

Schedulers

Integer RF

FP RF

µCode ROM

Store AGU Load AGU ALU ALU ALU ALU

FP move FP store

Fmul Fadd MMX SSE

L1 D-C

ache and D-TLB

3 4 TC Fetch

5 Drive

6 Alloc

9 Que

10 Sch

12 Sch

13 Disp

14 Disp

15 RF

16 RF

17 Ex

18 Flgs

19 BrCk

20 Drive

1 2 TC Nxt IP

7 8 Rename

11 Sch

100

ECE – 684 Netburst µArchitecture

101

ECE – 684 Branch Prediction

•  Centerpiece of dynamic execution –  Delivers high performance in pipelined µ- architecture

•  Allows continuous fetching and execution –  Predicts next instruction address

•  Branch is predictable within 4 or less iterations

Branch Prediction decreases the amount of instructions that would normally be flushed from pipeline

102

ECE – 684 Examples

If (a == 5) a = 7;

Else a = 5;

L1: lpcnt++; If ((lpcnt % 5)== 0)

printf (� Loop count is divisible by 5\n�);

Predictable Not Predictable

103










104

ECE – 684 Rapid Execution Engine

•  Contains 2 ALU�s –  Twice core processor frequency

•  Allows basic integer instructions to execute in ½ a clock cycle

•  Up to 126 instructions, 48 load, and 24 stores can be in flight at the same time

•  Example –  Rapid Execution Engine on a 1.50 GHz P4 Processor runs

at _________Hz?

105

ECE – 684

`

Out-of-Order Execution

Logic

Retirement Logic

Branch History Update

106

ECE – 684 Advanced Dynamic Execution

•  Out-of-Order Engine – Reorders Instructions – Executes as input operands are ready – ALU’s kept busy

•  Reports Branch History Information •  Increases overall speed

107





•  Memory Management –  Paging –  Virtual Memory –  Segmentation





108

ECE – 684

Memory Management

•  Management Facilities divided into two parts:

Segmentation - isolates individual processes so that multiple programs can on same processor without interfering w/each other.

Demand Paging - provides a mechanism for implementing a virtual-memory that is much larger than the actual memory, seemingly infinite.

109

ECE – 684 Memory Management

Address Translation Ex: Comp. Arch. I

Logical Address Segmentation

& Paging Physical Address

Control Word

Memory

Instruction Address

Instruction Decoder

Instruction Control Word

IA-32 Memory

(Virtual Address)

110

ECE – 684

Modes of Operation

•  Protected mode - Native operating mode of the processor. All features available, providing highest performance and capability.

- Must use segmentation, paging optional.

•  Real-address mode - 8086 processor programming environment

•  System management mode (SMM) - Standard arch. feature in all later IA-32 processors. Power management, OEM differentiation features

• Virtual-8086 mode - used while in protected mode, allows processor to execute 8086 software in a protected, multitasked environment.

Concentration on:

Other modes:

111

ECE – 684

Paging

•  Subdivide memory into small fixed-size �chunks� called frames or page frames

•  Divide programs into same sized chunks, called pages

•  Loading a program in memory requires the allocation of the required number of pages

•  Limits wasted memory to a fraction of the last page

•  Page frames used in loading process need not be contiguous

- Each program has a page table associated with it that maps each program page to a memory page frame

112

ECE – 684

Dir Page Offset

Paging Main Memory

Physical Address

Page Directory

Page Table

Control Word

IA-32: 2 - Level Paging

Linear Address Logical Address Segmentation

Virtual Memory:

•  Only program pages required for execution of the program are actually loaded

•  Only a few pages of any one program might be in memory at a time

•  Possible to run program consisting of more pages than can fit in memory

�Demand� Paging

113

ECE – 684

Segmentation

•  Programmer subdivides the program into logical units called segments

- Programs subdivided by function

- Data array items grouped together as a unit •  Paging - invisible to programmer, Segmentation - usually visible to programmer

- Convenience for organizing programs and data, and a means for associating access and usage rights with instructions and data

- Sharing, segment could be addressed by other processes, ex: table of data

- Dynamic size, growing data structure

114

ECE – 684

Address Translation

Dir Page Offset

Paging Main Memory

Physical Address

Page Directory

Page Table

Control Word

Linear Address Segment Offset

Segment Table

Index TI RPL

Index: The number of the segment. Serves as an index to the segment Table.

TI: (one bit) Table indicator indicates either global or local segment table to be used for translation

RPL: (two bits) Requested privilege level, 0=high privilege, 3 = low

115





•  Memory Management –  Paging –  Virtual Memory –  Segmentation





116

ECE – 684 Addressing Modes

- Determine technique for offset generation

+

+ Displacement (in instruction; 0, 8, or 32 bits)

Scale 1, 2, 4, or 8

x

Index Register

Base Register

Lim

it

Descriptor Registers

Effective Address (Offset)

Segment Offset

Linear Address

Segment Base

Address

Access Rights Limit

Base Address

Main Memory

Paging

(invisible to programmer)

117

ECE – 684

Mode AlgorithmImmediate Operand = ARegister operand LA = RDisplacement LA = (SR) + ABase LA = (SR) + (B)Base with displacement LA = (SR) + (B) + AScaled index with displacement LA = (SR) + (I) x S + ABase with index and displacement LA = (SR) + (B) + (I) + ABase with scaled index and displacement LA = (SR) + (I) x S + (B) + ARelative LA = (PC) + A

LA = linear address(X) = contents of XSR = segment registerPC = program counterA = contents of an address field in the instruction R = registerB = base registerI = index registerS = scaling factor

Addressing Modes

118

ECE – 684

+

+ Displacement (in instruction; 0, 8, or 32 bits)

Scale 1, 2, 4, or 8

x

Index Register

Lim

it

Descriptor Registers

Effective Address (Offset)

Segment

Linear Address

Segment Base

Address

Ex: scaled index with displacement

Access Rights Limit

Base Address

119

ECE – 684

Instruction Format

Instruction Prefixes Opcode Mod R/M SIB Displacement Immediate

Scale Index Base Mod Reg/Opcode R/M

Instruction Prefix

Operand Size

Override

Address Size

Override Segment Override

Bytes 0 to 4 0 or 1 0 or 1 0, 1, 2, or 4 1 or 2 0, 1, 2, or 4

Bytes 0 or 1 0 or 1 0 or 1 0 or 1

7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0

120










121

ECE – 684 Cache Organization

Physical Memory

System Bus (External)

Bus Interface Unit

L2 Cache

Instruction Decoder Trace Cache

Instruction TLBs

Data Cache Unit (L1)

Store Buffer

Data TLBs

122










123

ECE – 684 Enhanced FP & Multi-Media Unit

•  Expands Registers –  128-bit – Adds One Additional Register

•  Data Movement

•  Improves performance on applications – Floating Point – Multi-Media

124

ECE – 684

When the name Prescott was first heard, people started to assume Pentium 5 was coming, as there are a number of changes differentiate Prescott from the Northwood core: 90 Nm process, 1 MB L2 cache rather than 512 KB, the L1 data cache that was doubled to 16 KB, 13 new instructions referred to as SSE3 and the new pipeline that was extended from 20 to 31 stages, officially part of Intel's NetBurst architecture.

P4 Prescott

Source: Intel's New Weapon: Pentium 4 Prescott, February 1, 2004, Patrick Schmid , chim Roos, Bert Töpelt

125

ECE – 684

It looks like millions of other Pentium 4 processors, but there's something new: 1 MB L2 cache, 16 KB L1 data cache and SSE3, the fourth instruction set that Intel adds to the Pentium family (MMX, SSE, SSE2).

Package

126

ECE – 684

Intel's nomenclature is very simple, as they basically only add an E after the clock speed number, e.g. Pentium 4 3.0E GHz. Besides the three versions we reviewed in this article (2.8E / 3.0E / 3.2E GHz), Intel also is launching a low-cost Prescott version at 2.8A GHz and 133 MHz FSB speed without HyperThreading. That is of particular importance as the TDP (thermal design power) reached a new record: It is 103 Watts for the 3.4E and 3.2E GHz versions. Even more interesting is the TDP of the new P4 Extreme Edition at 3.4 GHz: 102.9 Watts.

Numbering - thermal

127

ECE – 684

With the advantage of a very small, circuitry production process of 90 nm, Intel was easily able to increase the L2 cache size. Instead of Northwood's 512 KB, Prescott can now access 1 MB. Regardless of the transistor count, the die size dropped from 127 mm2 to 112 mm2. At 3.4E GHz, Prescott has a maximum cache bandwidth of 108 GB/s. Additionally, Intel doubled the L1 data cache from 8 KB to 16 KB. Let's look back to 2000 when Intel launched the Pentium 4 Willamette, with a reduction in the cache size to 8 KB. Back then, the L1 cache had to be reduced to 8 KB in order to keep the latency at two clock cycles. Slower cache access would have worsened the performance gap with the Pentium III even more. Still today, it is very important to have fast caches, since both AGUs (address generation units) need to access it frequently.

cache

128

ECE – 684

After Intel's success with the Pentium 4's SSE2 instruction set (Streaming SIMD Extensions, 144 instructions), SSE3 is supposed to be a reaction to the wishes and desires of big software companies. This time, there are only 13 new instructions to make the programmer's life easier: • fisttp: fp to int conversion • addsubps, addsubpd, movsldup, movshdup, movddup: complex arithmetic • lddqu: video encoding • haddps, hsubps, haddpd, hsubpd: graphics (SIMD FP / AOS) • monitor, mwait: thread synchronization

SSE

129

ECE – 684

NetBurst Architecture: Now 31 Pipeline Stages

Architecture

130

ECE – 684 P4 Pipeline

Instructions are received over the 64 bit wide, 200 MHz and 6.4 GB/s fast system bus. Then they enter the L2 cache. The prefetcher analyses the instructions and activates the BTB (Branch Target Buffer) in order to get a branch prediction, accomplished by a determination on what data could be required next. The modified instruction set is sent through the instruction decoder that translates the x86 data into micro operations. The x86 instructions can be complex and frequently feature loops, which is why Intel abandoned the classic L1 instruction cache back with the first Pentium 4 Willamette in favor of the Execution Trace Cache. It is based on micro operations and is located behind the Instruction Decoder, making it the much smarter solution by eliminating unnecessary decoding work. The Execution Trace Cache stores and reorganizes chains of multiple micro operations in order to pass them to the Rapid Execution Engine in an efficient manner.

131

ECE – 684

If the BTB does not provide a branch prediction, the Instruction Decoder will perform a static prediction that is supposed to have only little impact on performance in case the prediction should be wrong. The little impact can be realized by an improved loop detection process. The dynamic branch prediction has also been updated and integer multiplication is now done within a dedicated unit. Predicting branches is a core element in order to enable high performance. If the processor knows or at least guesses what comes next, it will be able to fill its pipeline in an efficient manner. This has become even more important since the pipeline has been stretched from 20 stages to now 31 stages. Intel tries to reduce the complexity of each stage in order to run higher clock speeds. In exchange, the processor becomes more vulnerable to misprediction. Now it's quite obvious why Intel tried to increase all caches. In case of misprediction, it's more important than ever to "keep the system running". The right data thus must be available in order to fill the pipeline. In order to support that, the L1 data cache must have an eight-way associativity value associated with the ability to check if the requested data is already located inside the cache.

Prescott Branch Prediction

132

ECE – 684 Wafer & die size

133

ECE – 684 Wafer & die size

In contrast to the 200 mm wafers AMD uses, Intel's 300 mm Pizza-pie size models offer much more space. We have analyzed the theoretical amount of processors on each of those wafers in order to talk about availability, prices and finally the success of a processor (see above). It's either delightful or depressing (that depends on your personal view) to see how many processors can be made of one single wafer. The theoretical limit should be 588 Prescott processors in case of Intel's 300 mm wafers and 148 Opteron/Athlon64 FX CPUs with AMD's 200 mm models. Even if Intel yielded only 40%, it would still gain more than double the amount of processors than AMD with a 60% yield. Still you should not forget that Intel usually has to supply larger customers than AMD, and has the fab capacity to do so. Wafer fab, 85% yields are definitely possible and are being hit from time to time, but in mass production facilities, even a 70% yield is considered sufficiently high. When a production facility starts to begin producing a a new product, yield rates usually are tremendously lower until the production process begins to ramp up and begins producing mass-scale volumes.

134

ECE – 684 Multicore - Nehalem

Here you can see a die shot of the new Nehalem processor - in this iteration a four core design with two separate QPI links and large L3 cache in relation to the rest of the chip.

135

ECE – 684

Intel can easily create a range of processors from 1 core to 8 cores depending on the application and market demands. Eight core CPUs will be found in servers while you'll find dual core machines in the mobile market several months after the initial desktop introduction. SSE instructions get the bump to a 4.2 revision, better branch prediction and pre-fetch algorithms and simultaneous multi-threading (SMT) makes a return after a brief hiatus with the NetBurst architecture.

136

ECE – 684 HyperThreading Returns

SMT (simultaneous multi-threading) or HyperThreading is also a key to keeping the 4-wide execution engine fed with work and tasks to complete. With the larger caches and much higher memory bandwidth that the chip provides this is a very important addition.

137

ECE – 684

138

ECE – 684

139

ECE – 684 Power control

The Nehalem core also has a new trick in its bag that enables it to lower the power consumption of a core to nearly 0 watts - something that wasn't possible on previous designs. You can see in the image above what the total power consumption of a core was typically made up of with the Core 2 series of processors - clocks and logic are the majority of it yes, but a third or more is related to leakage of the transistors and was something that couldn't be turned off in prior designs.

Well with the independent power controller in the PCU and the different power planes that each core rests on, the power consumption for each core is completely independent from the others. You can see in this diagram that though Core 3 is loaded the entire time, both Core 2 and Core 0 are able to power down to practically 0 watts when their work load is complete.

140

ECE – 684

IBM Power PC (Power 4): p690 Architecture

141

ECE – 684

• Power4 CPUs • Caches • Memory • Prefetching

Overview

142

ECE – 684

• Basic features – 1.3 GHz clock speed – two independent floating point units – single instruction for floating point multiply-add (FMA) – theoretical peak is therefore 5.2 GFlops per CPU • Many typical features of modern RISC processors • Difficult to attain high percentage of peak performance – dense linear algebra is an exception – �good� applications realise 10-20% of peak • easy to get much less than this!

IBM Power4 CPU

143

ECE – 684

• Superscalar processor – capable of issuing up to 5 instructions per clock cycle – 2FP, 2 integer, 2 load/store, 1 branch, 1 logical • Two integer addition/logical units • Two floating point units – Single instruction for multiply-add – Non-pipelined divide and square root • 80 integer, 72 FP registers – only 32 virtual registers in the instruction set – hardware maps virtual registers to physical ones on the fly.

Power4 processors

144

ECE – 684

• Long pipeline – up to 20 cycles for each instruction from start to finish – FMA takes 6 cycles from reading registers to delivering result back to registers – not enough virtual registers to keep both FPUs busy all the time • not even Linpack approaches 100% of peak • Out-of-order execution – hardware can reorder instructions to make best use of the hardware resources – requires a great deal of internal bookkeeping!

Other features

145

ECE – 684

• Branch prediction – lots of hardware to try and predict branches – mispredicted branches cause pipeline to stall – 16 Kbit local and global branch predictor tables – overkill for scientific codes • most branches are back to the start of a loop • Speculative execution – can issue instructions ahead of branches – instructions are killed if they are not required – keeps pipeline full

Other features (continued)

146

ECE – 684

• Caches rely on temporal and spatial locality • Caches are divided into lines (a.k.a blocks) • Lines are organized as sets • A memory location in mapped to a set depending on its address • It can occupy any line within that set

Caches

147

ECE – 684

• A cache with 1 line per set is called direct mapped • A cache with k lines per set is called k-way set associative

• A cache with only 1 set is called fully associative

Cache terminology

148

ECE – 684

• When a line is loaded into the cache, its address determines which set it goes into. • In a direct mapped cache, it simply replaces the only line in the set • In a k-way set associative cache, there are k lines which could be ejected to make room for the new one – usual policy is to replace the least recently used (LRU) – better than random, but not always optimal – LRU line may still be the one required next!

Replacement policy

149

ECE – 684

• Caches may be: – write-through • data written to cache line and to lower memory level – write-back • data is only written to the cache. Lower levels updated when cache line is replaced • Caches may also be: – write allocate • if write location in not in cache the enclosing line is loaded into the cache (usual for write-back) – no write allocate • if write location is not in memory only the underlying level is modified (usual for write-through)

Caches and writes

150

ECE – 684

• p690 has 3 levels of cache – separate L1 data and instruction caches – unified L2 shared between 2 CPUs on a chip – global L3 cache (more of a memory buffer)

p690 Memory System

151

ECE – 684

• Instruction cache – 64Kbytes, direct mapped – 128 byte lines • Data cache – 32Kbytes – 2-way set associative – LRU replacement – 128-byte lines • write-through, no write allocate – 2x8-byte reads and 1x8-byte write per cycle. – 4-5 cycle latency.

L1 caches

152

ECE – 684

• A single chip comprises – 2 independent CPUs – shared L2 cache

Power4 Chip

153

ECE – 684

• 1440 Kb Unified (data+instructions) • 8-way set associative • Shared by both CPUs on the chip. – effectively each processor has 720Kb of cache • 128-byte lines – write-through, write allocate – loads in 32-byte chunks. • 14-20 cycle latency L2 -> registers • Cache has 3 independent sections of 480Kb – Lines within the 1440Kb unit are hashed to sections (consecutive lines never go to the same section).

L2 cache

154

ECE – 684 Power4 Chip

155

ECE – 684

• Chips are packaged up in groups of four – each Multi-Chip Module (MCM) has eight CPUs – all sharing the same L3 cache

Multi-Chip Modules

156

ECE – 684

• Really a memory buffer rather than a cache. • 128Mb per MCM (4 chips, 8CPU) • 8-way set-associative • 512 bytes lines • approx. 100 cycle latency • Usually only caches memory locations attached to the MCM • Shared by all CPUs – single CPU jobs get access to ALL the L3 cache in the system. • Does not allocate if already �busy�

L3 cache

157

ECE – 684

• 8 Gbytes of main memory per MCM – 1 Gbyte per processor • Accessible by all CPUs • 350-400 cycles latency from main memory to registers • Running one CPU on an MCM, a memory bandwidth of around 2.5 Gbyte/s is observed. • However, when running all 8 CPUs the aggregate bandwidth is around 8 Gbyte/s – poor scaling, or good single CPU performance? – beware of single CPU benchmarking

Main memory

158

ECE – 684

• Translation lookaside buffer – processor works on effective addresses – memory works on real addresses – TLB is a cache for the effective->real mapping • 1024 entries, 4 way set associative – each entry corresponds to a page (4 Kbytes) – whole TLB addresses 4 Mbytes – larger than L2 cache

TLB

159

ECE – 684

• Four MCMs make up a p690 frame – also called Regatta H – 32 CPUs and 32 Gb memory per frame – peak of 166.4 Gflops • Each frame configured as 4 machines – called Logical PARtitions – each LPAR maps to one MCM

Larger Shared-Memory Nodes

160

ECE – 684

• LPARs are almost completely independent – run separate operating systems – cannot access memory on a different LPAR • The 4 MCMs in a frames are connected by multiple busses – some cross-LPAR traffic does occur – cache coherency mechanisms cannot be turned off • Single LPAR performance can be impacted by jobs running on other LPARs in the same frame – can be on the order of 10% in worst case – not drastic, but noticable on some benchmarks.

Larger Shared-Memory Nodes – (continued)

161

ECE – 684

• p690 has a hardware prefetch capability – helps to hide the long latencies – make use of the available memory bandwidth • Simple algorithm for guessing which cache lines will be required in the near future – fetch them before they are requested • Prefetch engine monitors loads to cache lines – detects accesses to consecutive cache lines (128b) • either ascending of descending order in memory – two consecutive accesses trigger a prefetch stream

Hardware prefetching

162

ECE – 684

• Accesses to subsequent consecutive cache lines cause data to be fetched into the different caches – next line in sequence is fetched to L1 cache – line 5 ahead is fetched into L2 cache – lines 17, 18, 19 & 20 ahead (512 bytes) are fetched into L3 cache. • Distance ahead is long enough to hide the memory Latency • Up to 8 streams can be active at the same time • Stream stops when page boundary is crossed – every 4 Kbytes, unless large pages enabled

Hardware prefetching – (continued)

163

ECE – 684

• The Power4 Processor Introduction and Tuning Guide http://www.redbooks.ibm.com/redbooks/SG247041.html

Where to find out more

Newest Supercomputer

Documents

9.1.0 Branch Prediction Pentiums IBM PPC