PC Processor Microarchitecture - 國立臺灣大學chenhsiu/tech/PC Processor... · Web viewIn the article "PC Motherboard Technology", we developed some tools for analyzing a modern

Speculative, Out-of-Order Execution Gets a New Name

PC Processor Microarchitecture

TABLE OF CONTENTS

• Introduction

• Building a Framework for Comparison

• What Does a Computer Really Do?

• The Memory Subsystem

• Exploiting ILP Through Pipelining

• Exploiting ILP Via Superscalar Processing

• Exploiting Data-Level Parallelism Via SIMD

• Where Should Designers Focus The Effort?

• A Closer Look At Branch Prediction

• Speculative, Out-of-Order Execution Gets a New Name

• Analyzing Some Real Microprocessors: P4

• Pentium 4's Cache Organization

• Pentium 4's Trace Cache

• The Execution Engine Runs Out Of Order

• AMD Athlon Microarchitecture

• AMD Athlon Scheduler, Data Access

• Centaur C3 Microarchitecture

• Overall Conclusions

• List of References

Introduction

Isn't it interesting that new high-tech products seem so complicated, yet only a few years later we talk about how much

simpler the old stuff was? This is certainly true for microprocessors. As soon as we finally figure out all the new features

and feel comfortable giving advice to our family and friends, we're confronted with details about a brand-new processor

that promises to obsolete our expertise on the "old" generation. Gone are the simple and familiar diagrams of the past,

replaced by arcane drawings and cryptic buzzwords. For a PC technology enthusiast, this is like discovering a new world

to be explored and conquered. While many areas will seem strange and unusual, much of the landscape resembles

places we've traveled before. This article is meant to serve as a faithful companion for this journey, providing a guidebook

of the many wondrous new discoveries we're sure to encounter.

An Objective Tutorial and Analysis of PC Microarchitecture

The goal of this article is to give the reader some tools for understanding the internals of modern PC microprocessors. In

the article "PC Motherboard Technology", we developed some tools for analyzing a modern PC motherboard. This article

takes us one step deeper, zooming into the complex world inside the PC processor itself. The internal design of a

processor is called the "microarchitecture". Each CPU vendor uses slightly different techniques for getting the most out of

their design, while meeting their unique performance, power, and cost goals. The marketing departments from these

companies will often highlight microarchitectural features when promoting their newest CPUs, but it's often difficult for us

PC technology enthusiasts to figure out what it really means.

What is needed is an objective comparison of the design features for all the CPU vendors, and that's the goal of this

article. We'll walk through the features of the latest x86 32-bit desktop CPUs from Intel, AMD, and VIA (Centaur). Since

the Transmeta "Crusoe" processor is mostly targeted at the mobile market, we'll analyze their microarchitecture in another

article. It will also be the task for another article to thoroughly explore Apple's PowerPC G4 microprocessor, and many of

the analytical tools learned here will apply to all high-end processors.

http://www.extremetech.com/article/0,2299,s%253D1005%2526a%253D1059,00.asp

Building a Framework for Comparison

Before we can dive right into the block diagram of a modern CPU, we need to develop some analytical tools for

understanding how these features affect the operation of the PC system. We also need to develop a common framework

for comparison. As you'll soon see, that is no easy task. There are some radical differences in architecture between these

vendors, and it's difficult to make direct comparisons. As it turns out, the best way to understand and compare these new

CPUs is to go back to basic computer architectural concepts and show how each vendor has solved the common

problems faced in modern computer design. In our last section, we'll gaze into the future of PC microarchitecture and

make a few predictions.

Let's Not Lose Sight of What Really Matters

There is one issue that should be stressed right up front. We should never lose sight of the real objective in computer

design. All that really matters is how well the CPU helps the PC run your software. A PC is a computer system, and subtle

differences in CPU microarchitecture may not be noticeable when you're running your favorite computer program. We

learned this in our article on motherboard technology, since a well-balanced PC needs to remove all the bottlenecks (and

meet the cost goals of the user). The CPU designers are turning to more and more elaborate techniques to squeeze extra

performance out of these machines, so it's still really interesting to peek in on the raging battle for even a few percent

better system performance.

For a PC technology enthusiast, it's just downright fascinating how these CPU architects mix clever engineering tricks

with brute-force design techniques to take advantage of the enormous number of transistors available on the latest chips.


What Does a Computer Really Do?

It's easy to get buried too deeply in the complexities of these modern machines, but to really understand the design

choices, let's think again about the fundamental operation of a computer. A computer is nothing more than a machine that

reads a coded instruction, decodes the instruction, and executes it. If the instruction needs to load or store some data, the

computer figures out the location for the data and moves it. That's it; that's all a computer does. We can break this

operation into a series of stages:

The 5 Computer Operation StagesStage 1 Instruction Access (IA)

Stage 2 Instruction Decode (ID)

Stage 3 Execution (EX)

Stage 4 Data Access (DA)

Stage 5 Store (write back) Results (WB)

Some computer architects may re-arrange, combine, or break up the stages, but every computer microarchitecture does

these five things. We can use this framework to build on as we work our way up to even the most complicated CPUs.

For those of you who eat this stuff for breakfast and are anxious to jump ahead, remember that we haven't yet talked

about pipelines. These stages could all be completely processed for a single instruction before starting the next one. If

you think about that idea for a moment, you'll realize that almost all the complexity comes when we start improving on that

limitation. Don't worry; the discussion will quickly ramp up in complexity, and some readers might appreciate a quick

refresher. Let's see what happens in each of these stages:

Instruction Access

A coded instruction is read from the memory subsystem at an address that is determined by a program counter (PC). In

our analysis, we'll treat memory as something that hangs off to the side of our CPU "execution core", as we show in the

figure below. Some architects like to view memory and the system bus as an integral part of the microarchitecture, and

we'll show how the memory subsystem interacts with the rest of the machine.

Instruction Decode

The coded instruction is converted into control information for the

logic circuits of the machine. Each "operation code (Opcode)"

represents a different instruction and causes the machine to

behave in different ways. Embedded in the Opcode (or stored in

later bytes of the instruction) can be address information or

"immediate" data to be processed. The address information can

represent a new address that might need to be loaded into the

What Does a Computer Really Do?

PC (a branch address) or the address can represent a memory location for data (loads and stores). If the instruction

needs data from a register, it is usually brought in during this stage.

Execute

This is the stage where the machine does whatever operation was directed by the instruction. This could be a math

operation (multiply, add, etc.) or it might be a data movement operation. If the instruction deals with data in memory, the

processor must calculate an "Effective Address (EA)". This is the actual location of the data in the memory subsystem

(ignoring virtual memory issues for now), based on calculating address offsets or resolving indirect memory references (A

simple example of indirection would be registers that house an address, rather than data).

Data Access

In this stage, instructions that need data from memory will present the Effective Address to the memory subsystem and

receive back the data. If the instruction was a store, then the data will be saved in memory. Our simple model for

comparison gets a bit frayed in this stage, and we'll explain in a moment what we mean.

Write Back

Once the processor has executed the instruction, perhaps having been forced to wait for a data load to complete, any

new data is written back to the destination register (if the instruction type requires it).

Was There a Question From the Back of the Room?

Some of the x86 experts in the audience are going to point out the numerous special cases for the way a processor must

deal with an instruction set designed in the 1970s. Our five-stage model isn't so simple when it must deal with all the

addressing modes of an x86. A big issue is the fact that the x86 is what is called a "register-memory" architecture where

even ALU (Arithmetic Logic Unit) instructions can access memory. This is contrasted with RISC (Reduced Instruction Set

Computing) architectures that only allow Load and Store instructions to move data (register-register or more commonly

called Load/Store architectures).

The reason we can focus on the Load/Store architecture to describe what happens in each stage of a computer is that

modern x86 processors translate their native CISC (Complex Instruction Set Computing) instructions into RISC

instructions (with some exceptions). By translating the instructions, most of the special cases are turned into extra RISC

instructions and can be more efficiently processed. RISC instructions are much easier for the hardware to optimize and

run at higher clock rates. This internal translation to RISC is one of the ways that x86 processors were able to deal with

the threat that higher-performance RISC chips would take over the desktop in the early 1990s. We'll talk about instruction

translation more when we dig into the details of some specific processors, at which point we'll also show several ways in

which our model is dramatically modified.

To the questioner in the back of the room, there will be several things we're going to have to gloss over (and simplify) in

order to keep this article from getting as long as a computer textbook. If you really want to dig into details, check out the

list of references at the end of this article.

http://www.extremetech.com/article/0,2299,apn%253D20%2526s%253D2%2526a%253D1621%2526app%253D18%2526ap%253D19,00.asp

The Memory Subsystem

The memory subsystem plays a big part in the microarchitecture of a CPU. Notice that both the Instruction Access stage

and the Data Access stage of our simple processor must get to memory. This memory can be split into separate sections

for instructions and data, allowing each stage to have a dedicated (hence faster) port to memory.

This is called a "Harvard Architecture", a term from work at Harvard University in the 1940s that has been extended to

also refer to architectures with separate instruction and data caches--even though main memory (and sometimes L2

cache) is "unified". For some background on cache design, you can refer to the memory hierarchy discussion in the

article, "PC Motherboard Technology". That article also covers the system bus interface, an important part of the PC CPU

design that is tailored to support the internal microarchitecture.

Virtual Memory: Making Life Easier for the Programmer and Tougher for the Hardware Designer

To make life simpler for the programmer, most addresses are "virtual addresses" that allow the software designer to

pretend to have a large, linear block of memory. These virtual addresses are translated into "physical addresses" that

refer to the actual addresses of the memory in the computer. In almost all x86 chips, the caches contain memory data that

is addressed with physical addresses. Before the cache is accessed, any virtual addresses are translated in a

"Translation Look-aside Buffer (TLB)". A TLB is like a cache of recently-used virtual address blocks (pages), responding

back with the physical address page that corresponds to the virtual address presented by the CPU core. If the virtual

address isn't in one of the pages stored by the TLB (a TLB miss), then the TLB must be updated from a bigger table

stored in main memory--a huge performance hit (especially if the page isn't in main memory and must be loaded from

disk). Some CPUs have multiple levels of TLBs, similar to the notion of cache memory hierarchy. The size and structure

of the TLBs and caches will be important during our CPU comparisons later, but we'll focus mainly on the CPU core for

our analysis.


Exploiting ILP Through Pipelining

Instead of waiting until an instruction has completed all five stages of our model machine, we could start a new instruction

as soon as the first instruction has cleared stage 1. Notice that we can now have five instructions progressing through our

"pipeline" at the same time. Essentially, we're processing five instructions in parallel, referred to as "Instruction-Level

Parallelism (ILP)". If it took five clock cycles to completely execute an instruction before we pipelined the machine, we're

now able to execute a new instruction every single clock. We made our computer five times faster, just with this "simple"

change.

Let's Just Think About This a Minute

We'll use a bunch of computer engineering

terms in a moment, since we've got to keep that

person in the back of the room happy. Before

doing that, take a step back and think about

what we did to the machine. (Even experienced

engineers forget to do that sometimes.)

Suddenly, memory fetches have to occur five

times faster then before. This implies that

system and cache must now run five times as

fast, even though each instruction still takes five

cycles to completely execute.

We've also made a huge assumption that each stage was taking exactly the same amount of time, since that's the rule

that our pipeline clock is enforcing. What about the assumption that the processor was even going to run the next four

instructions in that order? We (usually) won't even know until the execute stage whether we need to branch to some other

instruction address. Hey, what would happen if the sequence of instructions called for the processor to load some data

from memory and then try to perform a math operation using that data in the next instruction? The math operation would

likely be delayed, due to memory latency slowing down the process.

They're Called Pipeline Hazards

What we're describing are called "pipeline hazards", and their effects can get really ugly. There are three types of hazards

that can cause our pipeline to come to a screeching halt--or cause nasty errors if we don't put in extra hardware to detect

them. The first hazard is a "data hazard", such as the problem of trying to use data before it's available (a "data

dependency"). Another type is a "control hazard" where the pipeline contains instructions that come after a branch. A

"structural hazard" is caused by resource conflicts where an instruction sequence can cause multiple instructions to need


the same processor resource during a given clock cycle. We'd have a structural hazard if we tried to use the same

memory port for both instructions and data.

Modern Pipelines Can Have a Lot of Stall Cycles

There are ways to reduce the chances of a pipeline hazard occurring, and we'll discuss some of the ways that CPU

architects deal with the various cases. In a practical sense, there will always be some hazards that will cause the pipeline

to stall. One way to describe the situation is to say that an instruction will "block" part of the pipe (something modern

implementations help minimize). When the pipe stalls, every (blocked) instruction behind the stalled stage will have to

wait, while the instructions fetched earlier can continue on their way. This opens up a gap (a "pipeline bubble") between

blocked instructions and the instructions proceeding down the pipeline in front of the blocked instructions.

When the blocked instruction restarts, the bubble will continue down the pipeline. For some hazards, like the control

hazard caused by a (mispredicted) branch instruction, the following instructions in the pipeline need to be killed, since

they aren't supposed to execute. If the branch target address isn't in the instruction cache, the pipeline can stall for a large

number of clock cycles. The stall would be extended by the latency of accesses to the L2 cache or, worse, accesses to

main memory. Stalls due to branches are a serious problem, and this is one of the two major areas where designers have

focused their energy (and transistor budget). The other major area, not surprisingly, is when the pipeline goes to memory

to load data. Most of our analysis will focus in on these 2 latency-induced problems.

Design Tricks To Reduce Data Hazards

For some data hazards, one

commonly-used solution is to

forward result data from a

completed instruction straight to

another instruction yet to

execute in the pipeline (data

"forwarding", though sometimes

called "bypassing"). This is

much faster than writing out the

data and forcing the other

instruction to read it back in. Our

case of a math operation

needing data from a previous

memory load instruction would

seem to be a good candidate for

this technique. The data loaded

from memory into a register can also be forwarded straight to the ALU execute stage, instead of going all the way through

the register write-back stage. An instruction in the write-back stage could forward data straight to an instruction in the

execute stage.

Why wait 2 cycles? Why not forward straight from the data access stage? In reality, the data load stage is far from

instantaneous and suffers from the same memory latency risk as instruction fetches. The figure below shows how this can

occur. What if the data is not in the cache? There would be a huge pipeline bubble. As it turns out, data access is even

more challenging than an instruction fetch, since we don't know the memory address until we've calculated the Effective


Address. While instructions are usually accessed sequentially, allowing several cache lines to be prefetched from the

instruction cache (and main memory) into a fast local buffer near the execution core, data accesses don't always have

such nice "locality of reference".

The Limits of Pipelining

If five stages made us run up to five times faster, why not chop up the work into a bunch more stages? Who cares about

pipeline hazards when it gives the marketing folks some really high peak performance numbers to brag about? Well,

every x86 processor we'll analyze has a lot more than five stages. Originally called "super-pipelining" until Intel (for no

obvious reason) decided to rename it "hyper-pipelining" in their Pentium 4 design, this technique breaks up various

processing stages into multiple clock cycles.

This also has the architectural benefit of giving better granularity to operations, so there should be fewer cases where a

fast operation waits around while slow operations throttle the clock rate. With some of the clever design techniques we'll

examine, the pipeline hazards can be managed, and clock rates can be cranked into the stratosphere. The real limit isn't

an architectural issue, but is related to the way digital circuits clock data between pipeline stages.

To pipeline an operation, each new stage of the pipeline must store information passed to it from a prior stage, since each

stage will (usually) contain information for a different instruction. This staged data is held in a storage device (usually a

"latch"). As you chop up a task into smaller and smaller pipeline stages, the overhead time it takes to clock data into the

latch ("set-up and hold" times and allowance for clock "skew" between circuits) becomes a significant percentage of the

entire clock period. At some point, there is no time left in the clock cycle to do any real work. There are some exotic circuit

tricks that can help, but it would burn a lot of power - not a good trade-off for chips that already exceed 70 watts in some

cases.

Exploiting ILP Via Superscalar Processing

While our simple machine doesn't have any serious structural hazards, that's only because it is a "single-issue"

architecture. Only a single instruction can be executed during a clock cycle. In a "superscalar" architecture, extra compute

resources are added to achieve another dimension of instruction-level parallelism. The original Pentium provided 2

separate pipelines that Intel called the U and V pipelines. In theory, each pipeline could be working simultaneously on 2

different sets of instructions.

With a multi-issue processor (where multiple instructions can be dispatched each clock cycle to multiple pipelines in the

single processor), we can have even more data hazards, since an operation in one pipeline could depend on data that is

in another pipeline. The control hazards can get worse, since our "instruction fetch bandwidth" rises (doubled in a 2-issue

machine, for example). A (mispredicted) branch instruction could cause both pipelines to need instructions flushed.

Issue Restrictions Limit How Often Parallelism Can Be Achieved

In practice, a superscalar machine has lots of "issue restrictions" that limit what

each pipeline is capable of processing. This structural hazard limited how often both

the U and V pipe of the Pentium could simultaneously execute 2 instructions. The

limitations are caused by the cost of duplicating all the hardware for each pipeline, so

the designers focus instead on exploiting parallelism in as many cases as practical.

Combining Superscalar with Super-Pipelining to Get the Best of Both

Another approach to superscalar is to duplicate portions of the pipeline. This becomes much easier in the new

architectures that don't require instructions to proceed at the same rate through the pipeline (or even in the original

program order). An obvious stage for exploiting superscalar design techniques is the execute stage, since PC's process

three different types of data. There are integer operations, floating-point operations and now "media" operations. We

know all about integer and floating-point. A media instruction processes graphics, sound or video data (as well as

communications data). The instruction sets now include MMX, 3DNow!, Enhanced 3DNow!, SSE, and SSE2 media

instructions. The execute stage could attempt to simultaneously process all three types of instructions, as long as there is

enough hardware to avoid structural hazards.

In practice, there are several structural hazards that require issue restrictions. Each new execution resource could also

have its own pipeline. Many floating-point instructions and media instructions require multiple clocks and aren't fully

pipelined in some implementations. We'll clear up any confusion when we analyze some real processors later. For now,

it's only important to understand the fundamentals of superscalar design and realize that modern architectures include

combinations of multiple pipelines running simultaneously.

Exploiting Data-Level Parallelism Via SIMD

We'll talk more about this later, but the new focus on media instructions has allowed CPU designers to recognize the

inherent parallelism in the way data is processed. The same operation is often performed on independent data sets, such

as multiplying data stored in a vector or a matrix. A single instruction is repeated over and over for multiple pieces of data.

We can design special hardware to do this more efficiently, and we call this a "Single Instruction Multiple Data (SIMD)"

computing model.

More Pressure on the Memory System

Once again, take a step back and think about the implications before that person in the back of the room gets us to dive

into implementation details. With some intuitive analysis, we can observe that we've once again put tremendous pressure

on our memory subsystem. A single instruction coming down our pipeline(s) could force multiple data load and store

operations. Thinking a bit further about the nature of media processing, some of the streaming media types (like video)

have critical timing constraints, and the streams can last for a long time (i.e. as a viewer of video, you expect a continuous

flow of the video stream over time, preferably without choppiness or interruptions). Our data caches may not do us much

good, since the data may only get processed once before the next chunk of data wants to replace it (data caches are

most effective when the same data is accessed over and over). Thus the CPU architects have some new challenges to

solve.

Where Should Designers Focus The Effort?

By now, you've likely come to realize that every CPU vendor is trying to solve similar problems. They're all trying to take a

1970s instruction set and do as much parallel processing as possible, but they're forced to deal with the limitations of both

the instruction set and the nature of memory systems. There is a practical limit to how many instructions can be

processed in parallel, and it gets more and more difficult for the hardware to "dynamically" schedule instructions around

any possible instruction blockage. The compilers are getting better at "statically" scheduling, based on the limited

information available at compile time. However, the hardware is being pushed to the limits in an attempt to look as far

ahead in the instruction stream as possible in the search for non-blocking instructions.

It's All About Memory Latency

As we've shown, there are 2 stages of our computer model where the designers can get the most return on their efforts.

These are Instruction Fetch and Data Access, and both can cause an enormous performance loss if not handled properly.

The problem is caused by the fact that our pipelines are now running at over one GHz, and it can take over 100 pipeline

cycles to get something from main memory. The key to solving the problem is to make sure that the required instructions

or data aren't sitting in main memory when you need them, but instead, are already in a buffer inside your pipeline--or at

least sitting in an upper level of your cache hierarchy.

Branch Prediction Can Solve the Problem With I-Fetch Latency

If we could predict with 100% certainty which direction a program branch is going (forward or backward in the instruction

stream), then we could make sure that the instructions following the branch instruction are in the correct sequence in the

pipeline. That's not possible, but improvement in the branch predictor can have a dramatic performance gain for these

modern, deeply-pipelined architectures. We'll analyze some branch prediction approaches later.

Data Memory Latency is Much Tougher to Handle

One way to deal with data latency is to have "non-blocking loads" so that other memory operations can proceed while

we're waiting for the data for a specific instruction to come back from the memory system. Every x86 architecture does

this now. Still, if the data is sitting in main memory when the load is being executed, the chip's performance will take a

severe hit. The key is to pre-fetch blocks of data before they're needed, and special instructions have been added to

directly allow the software to deal with the limited locality of data.

There are also some ways that the pipeline can help by buffering up load requests and using intelligent data pre-fetching

techniques based on the processor's knowledge of the instruction stream. We'll analyze some of the vendor solutions to

the problem of data access.

A Closer Look At Branch Prediction The person in the back of the room will be happy to hear that things are about to get more complicated. We're now going

to explore some of the recent innovations in CPU microarchitecture, starting with branch prediction. All the easy

techniques have already been implemented. To get better prediction accuracy, microprocessor designers are combining

multiple predictors and inventing clever new algorithms.

There really are three different kinds of branches:

Forward conditional branches - based on a run-time condition, the PC (Program Counter) is changed to point to

an address forward in the instruction stream.

Backward conditional branches - the PC is changed to point backward in the instruction stream. The branch is

based on some condition, such as branching backwards to the beginning of a program loop when a test at the

end of the loop states the loop should be executed again.

Unconditional branches - this includes jumps, procedure calls and returns that have no specific condition. For

example, an unconditional jump instruction might be coded in assembly language as simply "jmp", and the

instruction stream must immediately be directed to the target location pointed to by the jump instruction, whereas

a conditional jump that might be coded as "jmpne" would redirect the instruction stream only if the result of a

comparison of two values in a previous "compare" instructions shows the values to not be equal. (The segmented

addressing scheme used by the x86 architecture adds extra complexity, since jumps can be either "near" (within a

segment) or "far" (outside the segment). Each type has different effects on branch prediction algorithms.)

Using Branch Statistics for Static Prediction

Forward branches dominate backward branches by about 4 to 1 (whether conditional or not). About 60% of the forward

conditional branches are taken, while approximately 85% of the backward conditional branches are taken (because of the

prevalence of program loops). Just knowing this data about average code behavior, we could optimize our architecture for

the common cases. A "Static Predictor" can just look at the offset (distance forward or backward from current PC) for

conditional branches as soon as the instruction is decoded. Backward branches will be predicted to be taken, since that is

the most common case. The accuracy of the static predictor will depend on the type of code being executed, as well as

the coding style used by the programmer. These statistics were derived from the SPEC suite of benchmarks, and many

PC software workloads will favor slightly different static behavior.

Dynamic Branch Prediction with a Branch History Buffer (BHB)

To refine our branch prediction, we could create a buffer that is indexed by the low-order address bits of recent branch

instructions. In this BHB (sometimes called a "Branch History Table (BHT)"), for each branch instruction, we'd store a bit

that indicates whether the branch was recently taken. A simple way to implement a dynamic branch predictor would be to

check the BHB for every branch instruction. If the BHB's prediction bit indicates the branch should be taken, then the

pipeline can go ahead and start fetching instructions from the new address (once it computes the target address).

By the time the branch instruction works its way down the pipeline and actually causes a branch, then the correct

instructions are already in the pipeline. If the BHB was wrong, a "misprediction" occurred, and we'll have to flush out the

incorrectly fetched instructions and invert the BHB prediction bit.

Refining Our BHB by Storing More Bits

It turns out that a single bit in the BHB will be wrong twice for a loop--once on the first pass of the loop and once at the

A Closer Look At Branch Prediction

end of the loop. We can get better prediction accuracy by using more bits to create a "saturating counter" that is

incremented on a taken branch and decremented on an untaken branch. It turns out that a 2-bit predictor does about as

well as you could get with more bits, achieving anywhere from 82% to 99% prediction accuracy with a table of 4096

entries. This size of table is at the point of diminishing returns for 2 bit entries, so there isn't much point in storing more.

Since we're only indexing by the lower address bits, notice that 2 different branch addresses might have the same low-

order bits and could point to the same place in our table--one reason not to let the table get too small.

Two-Level Predictors and the GShare Algorithm

There is a further refinement we can make to our BHB by correlating the behavior of other branches. Often called a

"Global History Counter", this "two-level predictor" allows the behavior of other branches to also update the predictor bits

for a particular branch instruction and achieve slightly better overall prediction accuracy. One implementation is called the

"GShare algorithm". This approach uses a "Global Branch History Register" (a register that stores the global result of

recent branches) that gets "hashed" with bits from the address of the branch being predicted. The resulting value is used

as an index into the BHB where the prediction entry at that location is used to dynamically predict the branch direction.

Yes, this is complicated stuff, but it's being used in several modern processors.

Using a Branch Target Buffer (BTB) to Further Reduce the Branch Penalty

In addition to a large BHB, most predictors also include a buffer that stores the actual target address of taken branches

(along with optional prediction bits). This table allows the CPU to look to see if an instruction is a branch and start fetching

at the target address early on in the pipeline processing. By storing the instruction address and the target address, even

before the processor decodes the instruction, it can know that it is a branch. The figure below shows an implementation of

a BTB. A large BTB can completely remove most branch penalties (for correctly-predicted branches) if the CPU looks far

enough ahead to make sure the target instructions are pre-fetched. Using a Return Address Buffer to predict the return

from a subroutine One technique for dealing with the unconditional branch at the end of a subroutine is to create a buffer

of the most recent return addresses. There are usually some subroutines that get called quite often in a program, and a

return address buffer can make sure that the correct instructions are in the pipeline after the return instruction.

Speculative, Out-of-Order Execution Gets a New Name

While RISC chips used the same terms as the rest of the computer engineering community, the Intel marketing

department decided that the average consumer wouldn't like the idea of a computer that "speculates" or runs programs

"out of order". A nice warm-and-fuzzy term was coined for the P6 architecture, and "Dynamic Execution" was added to

our list of non-descriptive buzzwords.

Both AMD and Intel use a microarchitecture that, after decoding into simpler RISC instructions, tosses the instructions

into a big hopper and allows them to execute in whatever order best matches the available compute resources. Once the

instructions have finished executing out of order, results get "committed" in the original program order. The term

"speculation" refers to instructions being speculatively fetched, decoded and executed.

A useful analogy can be drawn to the stock market investor who "speculates" that a stock will go up in value and justify an

investment. For a microprocessor speculating on instructions in advance, if the speculation turns out to be incorrect, those

instructions are eliminated before any machine state changes are committed (written to processor registers or memory).

Once Again, Let's Take a Step Back and Try Some More Intuitive Analysis

By now that person in the back of the room has finally gotten used to these short pauses to look at the big picture. In this

case, we just made a huge change to our machine, and it's hard to easily conceptualize. We've completely scrambled the

notion of how instructions flow down a one-way pipeline. One thing that becomes obvious is the need for darn good

branch prediction. All that speculation becomes wasted memory bandwidth, execution time, and power if we end up

taking a branch we didn't expect. Following our stock investor analogy, if the value doesn't go up, then the investment

was wasted and could have been more productively used elsewhere. In fact, the speculation could make us worse off.

The need to wait before committing completed instructions to registers or memory

should probably be obvious, since we could end up with incorrect program behavior

and incorrect data--then have to try to unwind everything when a branch misprediction

(or an exception) comes along. The real power of this approach would seem to be

realized by having lots of superscalar stages, since we can reorder the instructions to

better match the issue restrictions of multiple compute resources. OK, enough

speculation, let's dig into the details:

Register Renaming Creates Virtual Registers

If you're going to have speculative instructions operating out of order, then you can't have them all trying to change the

same registers. You need to create a "register alias table (RAT)" that renames and maps the eight x86 registers to a

much larger set of temporary internal register storage locations, permitting multiple instances of any of the original eight

registers. An instruction will load and store values using these temporary registers, while the RAT keeps track of what the

latest known values are for the actual x86 registers. Once the instructions are completed and re-ordered so that we know

the register state is correct, then the temporary registers are committed back to the real x86 registers.

The Reorder Buffer (ROB) Helps Keep Instructions in Order

After an instruction is decoded, it's allowed to execute out of order as soon as the operands (data) become available. A

special Reorder Buffer is created to keep track of instruction status, such as when the operands become available for

execution, or when the instruction has completed execution and results can be "committed" or "retired" to architectural

registers or memory in the original program order. These instructions use the renamed register set and are "dispatched"

to the execution units as resources become available, perhaps spending some time in "reservation stations" that operate

as instruction queues at the front of various execution units. After an instruction has finished executing, it can be "retired"

by the ROB. However, the state still isn't committed until all the older instructions (with respect to program order) have

been retired first.

A neat thing about using register renaming, reservation stations, and the ROB is that a result from a completed instruction

can be forwarded directly to the renamed register of a new instruction. Many potential data dependencies go away

completely, and the pipelines are kept moving.

Load and Store Buffering Tries to Hide Data Access Latency

In the same way that instructions are executed as soon as resources become available, a load or a store instruction can

get an early start by using this speculative approach. Obviously, the stores can't actually get sent all the way to memory

until we're sure the data really should be changed (requiring we maintain program order). Instead, the stores are buffered,

retired, and committed in order. The loads are a more interesting case, since they are directly affected by memory

latency, the other key problem we highlighted earlier. The hardware will speculatively execute the load instruction,

calculating the Effective Address out of order. Depending on the implementation, it may even allow out-of-order cache

access, as long as the loads don't access the same address as a previous store instruction still in the processor pipeline,

but not yet committed. If in fact the load instruction needs the results of a previous store that has completed but is still in

the machine, the store data can get forwarded directly to the load instruction (saving the memory load time).

Analyzing Some Real Microprocessors: P4

We've come to the end of our tutorial on processor microarchitecture. Hopefully, we've given you enough analytical tools

so that you're now ready to dig into the details of some real products. There are a few common microarchitectural

features (like instruction translation) that we decided would be easier to explain as we show some real implementations.

We'll also look a bit deeper at the arcane science of branch prediction. Let's now take an objective look at the Intel P4,

AMD Athlon, and VIA/Centaur C3. We'll then do some more big-picture analysis and gaze forward to predict the future of

PC microarchitecture.

Intel Pentium 4 Microarchitecture

Intel is vigorously promoting the Pentium 4 as the preferred desktop processor, so we'll focus our Intel analysis on this

microarchitecture. We'll make a few comparisons to previous processor generations, but our goal is to gain a detailed

understanding of how the Pentium 4 meets its design goals. We'll leave it as an "exercise for the reader" to apply your

new analytical tools to the Pentium III. The Pentium 4 is the first x86 chip to use some newer microarchitectural

innovations, offering us an opportunity to explore some of these new approaches to dealing with the 2 key latency-

induced challenges in CPU design.

We should point out that our analysis only covers the "Willamette" version of the P4, while the forthcoming "Northwood"

will move to a .13 micron process geometry and make slight changes to the microarchitecture (most likely improving the

memory subsystem). We'll update this article when we get more information on Northwood.

The NetBurstTM Moniker Describes a Collection of Design Features

What's the point of introducing a new product without adding a new Intel buzzword? In this case, the name doesn't refer

to a single architectural improvement, but is really meant to serve as a name for this family of microprocessors. The

NetBurst design changes include a deeper pipeline, new bus architecture, more execution resources, and changes to the

memory subsystem. The figure below shows a block diagram of the Pentium 4, and we'll take a look at each major

section.


Deeply Pipelined for Higher Clock Rate

The Pentium 4 has a whopping 20-stage pipeline when processing a branch misprediction. The figure below shows how

this pipeline compares to the 10 stages of the Pentium III. The most interesting thing about the Pentium 4 pipe is that Intel

has dedicated 2 stages for driving data across the chip. This is fascinating proof that the limiting factor in modern IC

design has become the time it takes to transmit a signal across the wire connections on the chip. To understand why it's

fascinating, consider that it wasn't so long ago that designers only worried about the speed of transistors, and the time it

took to traverse such a short piece of metal was considered essentially instantaneous. Now we're moving from aluminum

to copper, just because electrons propagate faster with copper. (I can see that person in the back of the room is still with

us and is nodding in agreement.) This is fascinating stuff, and Intel is probably the first vendor to design a pipeline with

"Drive" stages.

What About All Those Problems with Long Pipelines?

Well, Intel has to work especially hard to make sure they avoid pipeline hazards. If that long pipeline needs to be flushed

very often, then the performance will be much lower than other designs. We should remind ourselves that the longer

pipeline actually results in less work being done on each clock cycle. That's the whole point of super-pipelining (or hyper-

pipelining, if you prefer), since doing less work in a clock cycle is what allows the clock cycle time to be shortened. The

pipeline has to run at a higher frequency just to do the same amount of work as a shorter pipeline. All other things being

equal, you'd expect the Pentium 4 to have less performance than parts with shorter pipelines at the same frequency.


Searching for Even More Instruction-Level Parallelism

As we learned, there is another thing to realize about long pipelines (besides being able to run at the high clock rates that

motivate uninformed buyers). Longer pipelines allow more instructions to be in process at the same time. The compiler

(static scheduler) and the hardware (dynamic scheduler) must keep the faster and deeper pipeline fed with the

instructions and data it needs during a larger instruction "window". The machine is going to have to search even further to

find instructions that can execute in parallel. As you'll see, the Pentium 4 can have an incredible 126 instructions in-flight

as it searches further and further ahead in the instruction stream for something to work on while waiting for data or

resource dependencies to clear.

Pentium 4's Cache Organization

Cache Organization in the Memory Hierarchy

As we described in our article on motherboard technology, there is usually a trade-off between cache size and speed.

This is mostly because of the extra capacitive loading on the signals that drive the larger SRAM arrays. Refer again to

block diagram of the Pentium 4. Intel has chosen to keep the L1 caches rather small so that they can reduce the latency

of cache accesses. Even a data cache hit will take 2 cycles to complete (6 cycles for floating-point data). We'll talk about

the L1 caches in a moment, but further down the hierarchy we find that the L2 cache is an 8-way, unified (includes both

instruction and data), 256KB cache with a 128B line size.

The 8-way structure means it has 8 sets of tags, providing about the same cache miss rate as a "fully-associative" cache

(as good as it gets). This makes the 256KB cache more effective than its size indicates, since the miss rate of this cache

is approximately 60% of the miss rate for a direct-mapped (1-way) cache of the same size.

The downside is that an 8-way cache will be slower to access. Intel states that the load

latency is 7 cycles (this reflects the time it takes an L2 cache line to be fully retrieved to

either the L1 data cache or the x86 instruction prefetch/decode buffers), but the cache

is able to transfer new data every 2 cycles (which is the effective throughput assuming

multiple concurrent cache transfers are initiated). Again, notice that the L2 cache is

shared between instruction fetches and data accesses (unified).

System Bus Architecture is Matched to Memory Hierarchy Organization

One interesting change for the L2 cache is to make the line size 128 bytes, instead of

the familiar 32 bytes. The larger line size can slightly improve the hit rate (in some cases), but requires a longer latency

for cache line refills from the system bus. This is where the new Pentium 4 bus comes into play. Using a 100MHz clock


Pentium 4’s Cache Organization

and transferring data four times on each bus clock (which Intel calls a 400MHz data rate), the 64-bit system bus can bring

in 32 bytes each cycle. This translates to a bandwidth of 3.2 GB/sec.

To fill an L2 cache line requires four bus cycles- the same number of cycles as the P6 bus for a 32-byte line). Note that

the system bus protocol has a 64-byte access length (matching the line size of the L1 cache) and requires 2 main

memory request operations to fill an L2 cache line. However, the faster bus only helps overcome the latency of getting the

extra data into the CPU from the North Bridge. The longer line size still causes a longer latency before getting all the burst

data from main memory. In fact, some analysts note that P4 systems have about 19% more memory latency than

Pentium III systems (measured in nanoseconds for the demand word of a cache refill). Smart pre-fetching is critical or

else the P4 will end up with less performance on many applications.

Pre-Fetching Hardware Can Help if Data Accesses Follow a Regular Pattern

The L2 cache has pre-fetch hardware to request the next 2 cache lines (256 bytes) beyond the current access location.

This pre-fetch logic has some intelligence to allow it to monitor the history of cache misses and try to avoid unnecessary

pre-fetches (that waste bandwidth and cache space). We'll talk more about the pre-fetcher later, but let's take a quick

pause for some of our patented intuitive analysis. We've described the problem of dealing with streaming media types

(like video) that don't spend much time in the cache. The hardware pre-fetch logic should easily notice the pattern of

cache misses and then pre-load data, leading to much better performance on these types of applications.

Designing for Data Cache Hits

Intel boasts of "new algorithms" to allow faster access to the 8KB, four-way, L1 data cache. They are most likely referring

to the fact that the Pentium 4 speculatively processes load instructions as if they always hit in the L1 data cache (and data

TLB). By optimizing for this case, there aren't any extra cycles burned while cache tags are checked for a miss. The load

instruction is sent on its merry way down the pipeline; if a cache miss delays the load, the processor passes temporarily

incorrect data to dependent instructions that assumed the data arrived in 2 cycles. Once the hardware discovers the L1

data cache miss and brings in the actual data from the rest of the memory hierarchy, the machine must "replay' any

instructions that had data dependencies and grabbed the wrong data.

It's unclear how efficient this approach will be, since it obviously depends on the load pattern for the applications. The

worst case would be an application that constantly loads data that is scattered around memory, while attempting to

immediately perform an operation on each new data value. The hardware pre-fetch logic would (perhaps mercifully) never

"trigger", and the pipeline would be constantly restarting instructions.

Again, the Pentium 4 design seems to have been optimized for the case of streaming media (just as Intel claims), since

these algorithms are much more regular and demand high performance. The designers probably hope that the

pathological worst case only occurs for code that doesn't need high performance. When the L1 data cache does have a

miss, it has a "fat pipe" (32 bytes wide) to the L2 cache, allowing each 64-byte cache line to be refilled in 2 clocks.

However, there is a 7-cycle latency before the L2 data starts arriving, as we mentioned previously. The Pentium 4 can

have up to four L1 data cache misses in process.

Pentium 4's Trace Cache

The Trace Cache Depends on Good Branch Prediction

Instead of a classic L1 instruction cache, the Pentium 4 designers felt confident enough in their branch prediction

algorithms to implement a trace cache. Rather than storing standard x86 instructions, the trace cache stores the

instructions after they've already been decoded into RISC-style instructions. Intel calls them "µops" (micro-ops) and

stores 6 µops for each "trace line". The trace cache can house up to 12K µops. Since the instructions have already been

decoded, hardware knows about any branches and fetches instructions that follow the branch. As we learned, it's the

conditional branches that could really cause a problem, since we won't know if we're wrong until the branch condition

check in Arithmetic Logic Unit 0 (ALU0) of the execution core. By then, our trace cache could have pre-fetched and

decoded a lot of instructions we don't need. The pipeline could also allow several out-of-order instructions to proceed if

the branch instruction was forced to wait for ALU0.

Hopefully, the alternative branch address is somewhere in the trace cache. Otherwise, we'll have to pay those 7 cycles of

latency to get the proper instructions from the L2 cache (pity us if it's not there either, as the L2 cache would need to get

the instructions from main memory) plus the time to decode the fetched x86 instructions. Intel's reference to the 20-stage

P4 pipeline actually starts with the trace cache, and does not include the cycles for instruction or data fetches from

system memory or L2 cache.

The Trace Cache has Several Advantages

If our predictors work well, then the trace cache is able to provide (the correct) three µops per cycle to the execution

scheduler. Since the trace cache is (hopefully) only storing instructions that actually get executed, then it makes more

efficient use of the limited cache space. Since the branch target instruction has already been decoded and fetched in

execution order, there isn't any extra latency for branches. The person in the back of the room just reminded us of an

interesting point. We never mentioned a TLB check for the trace cache, because it does not use one. So, the Pentium 4

isn't so complicated after all. Most of you correctly observed that this cache uses virtual addressing, so there isn't any

need to convert to physical addresses until we access the L2 cache. Intel documents don't give the size of the instruction

TLB for the L2 cache.

Pentium 4 Decoder Relies on Trace Cache to Buffer µops

The Pentium 4 decoder can only convert a single x86 instruction on each clock, fewer than other architectures. However,

since the µops are cached in the trace buffer (and hopefully reused), the decode bandwidth is probably adequate to

match the instruction issue rate (three µops/cycle). If an x86 instruction requires more than four µops, then the decoder

fetches µops directly from a µops "Read-Only Memory (ROM)". All x86 processor architectures use some sort of ROM for

infrequently used instructions or multi-cycle string operations.

The Execution Engine Runs Out Of Order

For an out-of-order machine, the main design goal is to provide enough parallel compute resources to make it worth all

the extra complexity. In this case, the machine is working to schedule instructions for 7 different parallel units, shown in

the figure below. Two of these units dispatch loads and stores (the Data Access stage of our original computer model).

The other processing tasks use multiple schedulers and are dispatched through the 2 Exec Ports. Each port could have a

fast ALU operation scheduled every half cycle, though other µops get scheduled every cycle. The figure below shows

what each port can dispatch.

Notice the numerous issue restrictions (structural hazards). If you were to have just fast ALU µops on both Exec Ports

and a simultaneous Load and Store dispatch, then a total of 6 µops/cycle (four double-speed ALU instructions, a Load,

and a Store) can be dispatched to execution units. The performance of the execution engine will depend on the type of

program and how well the schedulers can align µops to match the execution resources.

Retiring Instructions in Order and Updating the Branch Predictors

The Reorder Buffer can retire three µops/cycle, matching the instruction issue rate. There are some subtle differences in

the way the Pentium 4 ROB and register renaming are implemented compared to other processors like the Pentium III,

but the operation is very similar. As we've shown, a key to performance is to avoid mispredicted branches. As instructions

are retired from the ROB, the final branch addresses are used to update the Branch Target Buffer and Branch History

Buffer.

In case some of you have finally figured out modern branch predictors, Intel has chosen to rename the combination of a

BTB and a BHB. Intel calls the combination a "Branch Target Buffer (BTB)", insuring extra confusion for our new students

of computer microarchitecture.

Branch Prediction Uses a Combination of Schemes

While there isn't much public information about how the Pentium 4 does branch prediction, they likely use a two-level

predictor and combine information from the Static Prediction we discussed earlier. They also include a Return Address

Buffer of some undisclosed size. The specific algorithms are part of the "secret sauce" that processor vendors guard

closely. In the past, we've seen various patent filings describing algorithmic mechanisms used in branch predictors and

other processor subsystems. The patent details shed more light on their implementations than processor vendors would

otherwise choose to disclose publicly.

The Execution Engine Out of Order

Branch Hints Can Allow Faster Performance on a Known Data Set

The Pentium 4 also allows software-directed branch hints to be passed as prefixes to branch instructions. These branch

hints allow the software to override the Static Predictor and can be a powerful tool. This is particularly true if the program

is compiled and executed with special features enabled to collect information about program flow. The information from

the prior run can be fed back to the compiler to create a new executable with Branch Hints that avoid the earlier

mispredictions.

There is some potential for marketing abuse of this feature, since benchmarks that use a repeatable data set can be

optimized to avoid performance-killing branch mispredictions.

Support for New Media Instructions

The Pentium 4 has retained the earlier x86 instruction extensions (MMX and SSE) and added 144 new instructions they

call SSE2. It will be the task for another article to give a complete analysis and comparison of the x86 instruction

extensions and execution resources. However, as we've noted several times, the Pentium 4 is tuned for performance on

streaming media applications.

Poor Thermal Management Can Limit Performance

One potentially troubling feature of the Pentium 4 is the "Thermal Monitor" that can be enabled to slow the internal clock

rate to half speed (or less, depending on the setting) when the die temperature exceeds a certain value. On a 1.5 GHz

Pentium 4 (Willamette), this temperature currently equates to 54.7 Watts of power (according to Intel's Thermal Design

Guide and P4 datasheet). This is almost certainly a limitation of the package and heat sink, but the maximum power

dissipation of a 1.5 GHz part is currently about 73 Watts.

Intel would argue that this maximum would never be reached, but it is quite possible that demanding applications will

cause a poorly-cooled CPU to exceed the current thermal cut-off point - losing performance at a time when you need it

the most. As Intel moves to lower voltages in a more advanced manufacturing process, these limits will be less of a

problem - at current clock rates. As higher clock rate parts are introduced, the potential performance loss will again be an

issue.

Certainly, the Thermal Monitor is a good feature for ensuring that parts don't destroy themselves. It also is a clever

solution to the problem of turning on fans quickly enough to match the high thermal ramp rates. The concerns may only

arise for low-cost, inadequate heatsinks and fans. Customers may appreciate the system stability this feature offers, but

not the uncertainty about whether they're getting all the performance they paid for. We've heard from one of Intel's

competitors that certain Dell and HP Pentium 4 systems they tested do not enable this clock slow-down feature. This is

actually a good thing if Dell and HP are confident about their thermal solution. We plan to write a separate report on our

testing of this feature soon.

Overall Conclusions About the Pentium 4

The large number of complex new features in this processor has required a lot of explanation. Clearly, this is a design that

is intended to scale to dramatically higher clock rates. Only at higher clock rates does the benefit of the microarchitecture

become realized. It is also likely that the designers were forced to make painful trade-offs in the sizes for the on-chip

memory hierarchy. With a microarchitecture so sensitive to cache misses, it will be critical to increase the size of these

memories as transistor budgets increase. With good thermal management, higher clock rates and bigger caches, this chip

should compete well in desktop systems in the future, while doing very well today with streaming media, memory

bandwidth-intensive applications, and functions that use SSE2 instructions.

The Execution Engine Out of Order

AMD Athlon Microarchitecture

The Athlon architecture is more similar to our earlier analysis of speculative, out-of-order machines. This similarity is

partly due to the (comforting) maturity of the architecture, but it should be noted that the original design of the Athlon

microarchitecture emphasized performance above other factors. The more aggressive initial design approach keeps the

architecture sustainable while minor optimizations are implemented for clock speed or die cost.

AMD will soon ship a new version of Athlon, code-named "Palomino" and possibly sporting bigger caches and subtle

changes to the microarchitecture. For this article, we examine "Thunderbird", the design introduced in June 2000.

Parallel Compute Resources Benefit From Out-of-Order Approach

The extra complexity of creating an out-of-order machine is wasted if there aren't parallel compute resources available for

taking advantage of those exposed instructions. Here is where Athlon really shines. The microarchitecture can execute 9

simultaneous RISC instructions (what AMD calls "OPs").

The figure below shows the block diagram of Athlon. Note the extra resources for standard floating-point Ops, likely

explaining why this processor does so well on FP-intensive programs. (Well, that person in the back of the room is still

with us.) Yes, indeed the comparative analysis gets more complex if we include the P4's SSE2 instructions for SIMD

floating-point, but we'll have to leave that analysis for another article. The current Athlon architecture will certainly have

higher performance for applications that don't have high data-level parallelism.

Cache Architecture Emphasizes Size to Achieve High Hit Rate

Note that AMD has chosen to implement large L1 caches. The L1 instruction and data caches are each 2-way, 64KB

caches. The L1 instruction cache has a line-size of 64 bytes with a 64-byte sequential pre-fetch. The L1 data cache

provides a second data port to avoid structural hazards caused by the superscalar design. The L2 cache is a 16-way,

256KB unified cache, backed up by the fast EV6 bus we discussed in the motherboard article.

AMD Athlon Microarchitecture

If we take a step back and think about differences between P4 and Athlon memory hierarchies, we can make a few

observations. Intel's documentation states that their 12K trace cache will have the same hit rate as an "8K to 16K byte

conventional instruction cache". By that measure, the Athlon will have much better hit rates, though hits will have longer

latency for decoding instructions. An L1 miss is much worse for the P4's longer pipeline, though smart pre-fetching can

overcome this limitation. Remember, at these high clock rates, it doesn't take long to drain an instruction cache. It will

eventually come down to the accuracy of the branch predictor, but the Pentium 4 will still need a bigger trace cache to

match Athlon instruction fetch effectiveness.

Pre-Decoding Uses Extra Cache Bits

To deal with the complexities of the x86 instruction set, AMD does some early decoding of x86 instructions as they are

fetched into the L1 instruction cache. These extra bits help mark the beginning and end of the variable-length instructions,

as well as identify branches for the pre-fetcher (and predictor). These extra bits and early (partial) decoding give some of

the benefits of a trace cache, though there is still latency for the completion of the decoding.

Final Decoding Follows 2 Different Paths

Figure 9 shows the decode pipeline for the Athlon. Notice that it matches the flow of our original computer model,

breaking up Instruction Access and Decode stages into 6 pipeline stages. AMD uses a fixed-length instruction format

called a "MacroOp", containing one or more Ops. The instruction scheduler will turn MacroOp's into Op's as it dispatches

to the execution units. The "DirectPath Decoder" generates MacroOp's that take one or two Ops. The "VectorPath

Decoder" fetches longer instructions from ROM. Notice in the figure below that the Athlon can supply three

MacroOp's/cycle to the instruction decoder (the IDEC stage), and later they'll enter the instruction scheduler, equating to a

maximum of 6 Ops/cycle decode bandwidth. Note that the actual decode performance depends on the type of

instructions.

AMD Athlon Scheduler, Data Access

Integer Scheduler Dispatches Ops to 6 Execution Units

The figure below shows how pipeline stage 7 buffers up to 18 MacroOP's that are dispatched as Ops to the integer

execution units. This (reservation station) is where instructions wait for operands (including data from memory) to become

available before executing out of order. As you'll recall, there is a Reorder Buffer that keeps track of instruction status,

operands, and results ensuring the instructions are retired and committed in program order. Note that Integer Multiply

instructions require more compute resources and force extra issue restrictions.

Data Access Forces Instructions to Wait

Even for an out-of-order machine, our original computer model still holds up well. Notice in the figure below that loads and

stores will use the "Address Generation Units (AGU's)" to calculate the Effective Address (cycle 9 ADDGEN stage) and

access the data cache (cycle 10 DC ACC). In cycle 11, the data cache sends back a hit/miss response (and potentially

the data). If another instruction is waiting in the scheduler for this data, the data is forwarded. Cache misses will cause the

instructions to wait. There is a separate 44-entry Load/Store Unit (LSU) that manages these instructions.

Floating Point Instructions Have Their Own Scheduler and Pipeline

The Athlon can simultaneously process three types of floating-point instructions (FADD, FMUL, and FSTORE), as shown

in the figure below. The floating-point units are "fully pipelined", so that new FP instructions can start while other

instructions haven't yet completed. MMX/3DNow! instructions can be executed in the FADD and FMUL pipelines. The FP

instructions execute out of order, and each of the three pipelines has several different execution units. There are some

issue restrictions that apply to these pipelines. The performance of the Athlon's fully-pipelined FP units allow it to

consistently outperform the Pentium III at similar clock speeds, and a 1.33GHz Athlon even performs better than a

1.5GHz Pentium 4 in some FP benchmarks. We haven't seen enough SSE2-optimized applications to draw a definitive

conclusion with applications that may benefit from SSE2, however.

AMD Athlon Schedulaer, Data Access

Branch Prediction Logic is a Combination of the Latest Methods

There is a 2048-entry Branch Target Buffer that caches the predicted target address. This works in concert with a Global

History Table that uses a "bimodal counter" to predict whether branches are taken. If the prediction is correct, then there

is a single-cycle delay to change the instruction fetcher to the new address. (Note that the P4 trace cache doesn't have

any predicted-branch-taken delays). If the predictor is wrong, then the minimum delay is 10 cycles. There is also a 12-

entry Return Address Buffer.

Overall Conclusions About the Athlon Microarchitecture

To prevent this article from beocming interminably long, we have to gloss over many features of the Athlon architecture,

and undoubtedly several features will change as new versions are introduced. The main conclusion is that Athlon is a

more traditional, speculative, out-of-order machine and requires fewer pipeline stages than the Pentium 4. At the same

clock rate, Athlon should perform better than Pentium 4 on many of today's mainstream applications. The actual

comparison ratio would depend on how well the P4's SSE2 instructions are being used, how well the P4's branch

predictors and pre-fetchers are working, and how well the system/memory bus is being utilized. Memory bandwidth-

intensive applications favor the P4 today. There is a lot of room for optimizing code to match the microarchitecture, and

both AMD and Intel are working with software developers to tune the applications. We look forward to seeing what

enhancements AMD delivers with Palomino.

Centaur C3 Microarchitecture

Even though VIA/Centaur doesn't have the same market share as Intel and AMD, they have an experienced design team

and some interesting architectural innovations. This architecture also makes a nice contrast with the Intel and AMD

approaches, since Centaur has been able to stay with an in-order pipeline and still achieve good performance. The

Centaur chips use the same P6 system bus and Socket 370 motherboards.

A great cost advantage for C3 is its diminutive size--only 52 sqmm in its .18 micron process. This compares to 120 sqmm

for Athlon and 217 sqmm for P4. Also, the fastest C3 today at 800MHz consumes a very modest 17.4 watts max at 1.9V,

with typical power measured at 10.4 watts. This is much more energy-efficient than Athlon and P4.

Improving the Memory Subsystem to Solve the Key Problems

There are some philosophical differences of opinion on how best to spend the limited transistor budget, especially for

architectures specifically designed for lower cost and power. Intel and AMD are battling for the high-end where the fastest

CPUs command a price premium. They can tolerate the expense of larger die sizes and more thermally-effective

packages and heat sinks. However, when the goal of maximum performance drops to a number 2 or 3 slot behind power

and cost, then different design choices are made.

Up until now, Intel and AMD have made slight modifications to their high-performance architectures to address these

other markets. As the markets bifurcate further, AMD and Intel may introduce parts with microarchitectures that are more

optimized for power and cost.

Centaur Uses Cache Design to Directly Deal with Latency

VIA (Centaur) has made early design choices to target the low-cost markets. Centaur has stressed the value of optimizing

the memory subsystem to solve the key problems of memory latency. If you're constraining your die size to reduce cost,

then many processor designers feel it's often a better trade-off to use those transistors in the memory subsystem.

Centaur's chip architects believe that their large L1 caches (four-way, 64KB each) give them a better performance return

than if they had used the die area (and design time) to more aggressively reschedule instructions in the pipeline. If latency

is the key problem, then clever cache design is a direct way to address it. The figure below shows the block diagram of

the Centaur processor. The Cyrix name has recently been dropped, and this product is marketed as the "VIA C3"

(internally referred to as C5B).


Decoupling the Pipeline to Reduce Instruction Blockage

Even with a pipeline that processes instructions in-order, it is possible to solve many of the key design problems by

allowing the different pipeline stages to process groups of instructions. At various stages of the pipeline, instructions are

queued up while waiting for resources to come available. Called a "decoupled architecture", an in-order machine like the

Centaur C3 processor will have the same performance as the out-of-order approach we've described, as long as no

instructions block the pipeline. If a block occurs at a later stage of the pipeline, the in-order machine continues to fill

queues earlier in the pipeline while waiting for dependencies to clear. It can then proceed again at full speed to drain the

queues.

This is somewhat analogous to the reservation stations in the out-of-order architectures. As Centaur continues to refine

their architecture, they plan to further decouple the pipeline by adding queues to other stages and execution units.

Super-Pipelining an In-Order Microarchitecture

The 12 stages of the C3 pipeline are shown on the right-hand side of the block diagram in figure 13. By now, you're

probably able to easily identify what happens in each stage. Instructions are fetched from the large I-cache and then pre-

decoded (without needing extra pre-decode bits stored in the cache). The decoder works by first translating x86

instructions into an interim x86 format and placing them into a five-deep buffer, at which point enough is known about

branches to enable static prediction.


From this buffer, the interim instructions are translated into micro-instructions, either directly or from a microcode ROM.

The micro-instructions are queued again before passing through the final decoder where they also receive any data from

registers. From there, the instructions are dispatched to the appropriate execution unit, unless they require access to the

data cache.

Note that this pipeline has the Data Access stages before execution, much different from our computer model. We'll talk

about the implications in a moment. The floating-point units are not designed for the highest performance, since they run

at half the pipeline frequency and are not fully pipelined (a new FP instruction starts every other cycle). After the

execution stage, all instructions proceed through a "Store-Branch" stage before the result registers are updated in the

final pipeline stage. Note that the C3 supports MMX and 3DNow! instructions.

Breaking Our Simple Load/Store Computer Model

During the Store-Branch stage, a couple of interesting things occur. If a branch instruction is incorrectly predicted, the

new target address is sent to the I-cache in this stage. The other operation is to move Store data into a store buffer. Since

an instruction has to pass through this pipeline stage anyway, Centaur was able to directly implement the common Load-

ALU and Load-ALU-Store instructions as single micro-instructions that execute in a single cycle (with data required to be

loaded before the execute stage).

This completely removes the extra Load and Store instructions from the instruction stream (as found in other current x86

processors following internal RISC principles), speeding up execution time for these operations. No other modern x86

processor has this interesting twist to the microarchitecture. It also has the unfortunate side effect of complicating our

original, simple model of a computer pipeline, since this is a register-memory operation.

A Sophisticated Branch Prediction Mechanism

Since the C3 pipeline is fairly deep (P4's pipeline has changed our perspective), good branch prediction becomes quite

important. (That person in the back of the room is going to love this discussion, since Centaur uses every trick and

invents some more.) Centaur takes the interesting approach of directly calculating the target for unconditional branches

that use a displacement value (to an offset address). The designers decided that including a special adder early in the

pipeline was better than relying on a Branch Target Buffer for these instructions (about 95% of all branches). Obviously,

directly calculating the address will always give the correct target address, whereas the BTB may not always contain the

target address.

For conditional branches, Centaur used the G-Share algorithm we described earlier. This uses a 13-bit Global Branch

History that is XOR'd with the branch instruction address (an exclusive-OR of each pair of bits returns a 1 if ONLY one

input bit is a 1). The result indexes into the Branch History Buffer to look up the prediction of the branch. Centaur also

uses the "agrees-mode" enhancement to encode a (single) bit that indicates whether the table look-up agrees with the

static predictor. They also have another 4K-entry table that selects which predictor (simple or history-based) to use for a

particular branch (based on the previous behavior of the branch). Basically, Centaur uses a static predictor and two

different dynamic predictors, as well as a predictor to select which type of dynamic predictor to use. To that person in the

back of the room, if you'd like to know more, check out Centaur's patent filings. A future ExtremeTech article will focus

specifically on branch prediction methods.

Overall conclusions about the Centaur architecture

This microarchitecture has some interesting innovations that are made possible by staying with an in-order pipeline and

focusing on low-cost, single-processor systems. While these microarchitectural features are interesting, our analysis

doesn't draw any conclusions about performance (except to note the half-speed FP unit). The performance will depend on


the type of applications, and a CPU that is optimized for cost should really be viewed at the system level. If cost is a

primary concern, then the entire system needs to be configured with the minimum hardware required to acceptably run

the applications you care about. Stay tuned to ExtremeTech for benchmarks of these budget PCs.

Overall Conclusions

This ends our journey of the strange world inside modern CPUs. We started from basic concepts and went very rapidly

through a lot of complicated stuff. We hope you didn't have too much trouble digesting it all at one sitting. As we stated at

the very beginning, the details about microarchitecture are only interesting to CPU architects and hard-core PC

technology enthusiasts. As you've learned, the designers have made several trade-offs, and they've been forced to

optimize for certain types of applications. If those applications are important to you, then check out the appropriate

benchmarks running on real systems. In that way, the CPU microarchitecture can be analyzed in the context of the entire

PC system.

The Future of PC Microarchitectures

It used to be easy to forecast the sort of microarchitectural features coming to PC processors. All one had to do was look

at high-end RISC chips or large computer systems. Well, most of the high-end design techniques have already made their

way into the PC processor world, and to go forward will require new innovation by the PC CPU vendors.

Teaching an Old Dog New Tricks

One interesting trend is to return to older approaches that were not previously viable for the mainstream. The most

noteworthy example is "Very Long Instruction Word (VLIW)" architectures. This is what is referred to as an "exposed

pipeline" where the compiler must specifically encode separate instructions for each parallel operation in advance of

execution. This is much different than forcing the processor to dynamically schedule instructions while it is running.

The key enabler is that compiler technology has improved dramatically, and a VLIW architecture makes the compiler do

more of the work for discovering and exploiting instruction-level-parallelism. Transmeta has implemented an internal

VLIW architecture for their low-power Crusoe CPUs, counting on their software morphing technology to exploit the parallel

architecture. Intel's new 64-bit "Itanium" architecture uses a version of VLIW, but it has been slow to get to market. It will

be several years before enough interesting desktop applications can be ported to Itanium and make it a mainstream

desktop CPU.

AMD Plans to Hammer Its Way into the High End of the Market

Instead of counting on new compilers and the willingness of software developers to support a radically-new architecture

(like Itanium), AMD is evolving the x86 instruction set to support full 64-bit processing. With a 64-bit architecture, the

"Hammer" series of processors will be better at working on very large problems that require more addressing space

(servers and workstations). There will also be a performance gain for some applications, but the real focus will be support

for large, multi-processor systems. Eventually, the Hammer family could make its way down into the mainstream desktop.

Still Some Features to Copy From RISC

Some new RISC chips have an interesting and exciting feature that hasn't yet made its way into the PC space. Called

"Simultaneous Multithreading (SMT)", this approach duplicates all the registers and swaps register sets whenever a

"thread" comes to a long-latency operation. A thread is just an independent instruction sequence, whether explicitly

defined in a single program or part of a completely different process. This is how multi-processing works with advanced

operating systems, dispatching threads to different processors. Imagine that future CPUs may take thousands of pipeline

cycles for a main memory load.

Overall Conclusions

In an SMT machine, rather than have a processor sit idle while waiting for data from memory, it could just "context switch"

to a different register set and run code from the different thread. The more sets of registers, the more simultaneous

threads the CPU could switch between. It is rumored that Intel's new XEON processor based on the P4 core actually has

SMT capability built-in but not yet enabled.

Integration and a Change in Focus

Most of the recent architectural innovation has been directed at performing better on media-oriented tasks. Instead of just

adding instructions for media processing, why not create a media processor that can also handle x86 instructions? A

media processor is a class of CPU that is optimized for processing multiple streams of timing-critical media data.

The shift in focus from "standard" x86 processing will become even more likely as CPUs are more tightly-integrated with

graphics, video, sound and communications subsystems. It's unlikely that vendors would market their products as x86-

compatible media processors, rather than just advanced x86 processors, but the shift in design focus is already

underway.

Getting Comfortable with Complexity

In all too short a time, even these forthcoming technologies will seem like simple designs. We'll soon find it humorous that

we thought a GHz processor was a fast chip. We'll eventually consider it quaint that most computers used only a single

processor, since we could be working on machines with hundreds of CPUs on a chip. Someday we might be forced to

pore through complicated descriptions of the physics of optical processing. We can easily imagine down the road that

some people will long for the simple days when our computers could send data with metal traces on the chips or circuit

boards.

In closing, if you've made it all the way through this article, you agree with that enthusiastic person in the back of the

room. As PC technology enthusiasts, our hobby will just get better and better. These complex new technologies will open

up yet more worlds for our discovery, and we'll be inspired to explore every new detail.

List of References

References and Suggestions for Further Reading:

1. Computer Architecture, a Quantitative Approach, 2nd Edition. Morgan Kaufmann Publishers. Written by Hennessy

& Patterson. This is a great book and is a collaboration between John Hennessy (the Stanford professor who

helped create the MIPS architecture) and Dave Patterson (the Berkely professor who helped create the SPARC

architecture).

2. Pentium Pro and Pentium II System Architecture, 2nd Edition. Mindshare, Inc. Written by Tom Shanley. This book

is slightly out of date, but Tom does a great job of exposing extra details that aren't part of Intel's official

documentation.

3. The Microarchitecture of the Pentium 4 Processor. Intel Technology Journal, First Quarter 2001.

http://developer.intel.com/technology/itj/q12001/articles/art_2.htm

Written by Glenn Hinton, Dave Sager, Mike Upton, Darrell Boggs, Doug Carmean, Alan Kyker, and Patrice

Roussel of Intel Corporation. This is a surprisingly-detailed look at the Pentium 4 microarchitecture and design

trade-offs.

4. Other Intel links:

o ftp://download.intel.com/pentium4/download/netburstdetail.pdf

o ftp://download.intel.com/pentium4/download/nextgen.pdf

o ftp://download.intel.com/pentium4/download/netburst.pdf

5. AMD Athlon Processor x86 Code Optimization.

http://www.amd.com/products/cpg/athlon/techdocs/pdf/22007.pdf Appendix A of this document has an excellent

walk-through of the Athlon microarchitecture.

6. Other AMD links:

o http://www.amd.com/products/cpg/athlon/techdocs/index.html

7. Other Centaur Links:

o http://www.viatech.com

o http://www.centtech.com

http://www.centtech.com/

http://www.viatech.com/jsp/en/index.jsp

http://www.amd.com/products/cpg/athlon/techdocs/index.html

http://www.amd.com/products/cpg/athlon/techdocs/pdf/22007.pdf

ftp://download.intel.com/pentium4/download/netburst.pdf

ftp://download.intel.com/pentium4/download/nextgen.pdf

ftp://download.intel.com/pentium4/download/netburstdetail.pdf

http://developer.intel.com/technology/itj/q12001/articles/art_2.htm

Documents

PC Processor Microarchitecture - 國立臺灣大學chenhsiu/tech/PC Processor... · Web viewIn the article "PC Motherboard Technology", we developed some tools for analyzing a modern