Pipelining Computer Archiitecture

Slide 1

Chapter 3

Instruction-Level Parallelism and Its Exploitation

Computer Architecture

A Quantitative Approach, Fifth Edition

#

#

The University of Adelaide, School of Computer Science

31 March 2015

Chapter 2 Instructions: Language of the Computer

1

Copyright 2012, Elsevier Inc. All rights reserved.

Introduction

Pipelining become universal technique in 1985

Overlaps execution of instructions

Exploits Instruction Level Parallelism

Beyond this, there are two main approaches:

Hardware-based dynamic approaches

Used in server and desktop processors

Not used as extensively in PMP processors

Compiler-based static approaches

Not as successful outside of scientific applications

Introduction

#


31 March 2015


2

Basic and Intermediate Concept

What is Pipelining?

Pipelining is an implementation technique whereby multiple instructions are overlaped in execution.

Pipe stage (pipe segment)

Throughput

Machine cycle: The time required between moving an instruction one step down the pipeline. This time is equal to the time required for the slowest pipe stage.

In a computer, the machine cycle is usually one clock cycle.

The pipeline designers goal is to balance the length of each pipe stage.

If the stages are perfectly balanced,

#

A Simple Implementation of A RISC ISA

Five-cycle implementation

Instruction fetch cycle (IF)

Instruction decode/register fetch cycle (ID)

Operand fetches;

Sign-extending the immediate field;

Decoding is done in parallel with reading registers. This technique is known as fixed-field decoding;

Test branch condition and computed branch address; finished branching at the end of this cycle.

Execution/effective address cycle (EX)

Memory reference;

Register-Register ALU instruction;

Register-Immediate ALU instruction;

Memory access/branch completion cycle (MEM)

Write-back cycle (WB)

Register-Register ALU instruction;

Register-Immediate ALU instruction;

Load instruction;

#

Performance of the Five-Cycle Implementation

CPI=4.54

Branch instructions (12%) take 2 cycles

Store instructions (10%) require 4 cycles

Others takes 5 cycles

#

The Classic Five-Stage Pipeline

#

The RISC Pipeline with Registers

#

Instruction Issue

The process of letting an instruction move from the instruction decode stage (ID) into execution stage (EX) of this pipeline.

#

Basic Performance Issues in Pipelining

Pipelining increasing instruction execution throughput, but it does not reduce the execution time of an individual instruction due to pipeline overhead.

Register delay

Clock skew

The limitation of pipeline depth is due to

Pipeline latency

Pipe stage imbalance

Pipeline overhead

#

Performance of Pipeline with Stalls

When pipelining is thought of as decreasing the CPI,

#

When pipelining is thought of as improving the clock cycle time,

#

Copyright 2012, Elsevier Inc. All rights reserved.

Dependence

Dependencies are a property of programs

Pipeline organization determines if dependence is detected and if it causes a stall

Data dependence conveys:

Possibility of a hazard

Order in which results must be calculated

Upper bound on exploitable instruction level parallelism

Dependencies that flow through memory locations are difficult to detect

Introduction

#


31 March 2015


12

Pipelining Hazards

A hazard is a situation that prevents the next instruction in the instruction stream from executing during its designated clock cycle.

Three classes of hazards

Structural hazard: Arise from resource conflicts.

Data hazard: Arise when an instruction depends on the results of a previous instruction.

Control hazard: Arise from branches and other instructions that change the PC.

A pipeline can be stalled by a hazard. To eliminate hazards,

Instructions issued later than the stalled instruction are also stalled.

Instructions issued earlier than the stalled one must continue.

Note that a cache miss stalls the whole pipeline.

#

Structural Hazards

If certain combination of instructions cant be accommodated because of resource conflicts, the machine is said to have a structural hazard

It can be generated by:

Some functional unit is not fully pipelined

Some resources has not been duplicated enough to allow all the combinations in the pipeline to execute

For example: a machine may have only one register file write port, but under certain conditions, the pipeline might want to perform two writes in one clock cycle this will generate structural hazard

When a sequence of instructions encounter this hazard, the pipeline will stall one of the instructions until the required unit is available

Such stalls will increase the Clock cycle Per Instruction from its ideal 1 for pipelined machines

#

Structural Hazards

Due to resource conflicts Due to some functional unit being not fully pipelined.

When some resources have not been duplicated enough.

#

Structural Hazards

#

16

To solve the problem, a stall cycle is added. The effect of the pipeline bubble is actually to occupy the resources for that instruction slot as it travels through the pipeline. Performance wise, instruction 3 will not complete during clock cycle 8, but during clock cycle 9.

Structural Hazards

Another way to represent the stall no instruction is initiated in clock cycle 4
Instruction NumberClock number12345678910loadIFIDEXMEMWBInstruction i+1IFIDEXMEMWBInstruction i+2IFIDEXMEMWBInstruction i+3stallIFIDEXMEMWBInstruction i+4IFIDEXMEMWBInstruction i+5IFIDEXMEM

#

17

Sometime those diagrams are drawn with a stall occupying a whole raw, with instruction 3 being moved to the next raw. In either case, the effect is the same. The instruction 3 is not beginning execution until cycle 5.

Structural Hazards

A machine with structural hazard will have higher CPI

Why a designer allows structural hazard?

To reduce cost

Pipelining all the functional units or duplicating them may be too costly

To reduce latency

Introducing too many pipeline stages may cause latency issues

#

Data Hazards

A memory access depends on the results of unfinishing instructions.

#

Forwarding ALU Results

#

Forwarding Results to Store

#

Bypassing Results of LOAD

#

Data Hazard Classification

Consider two instructions i and j, with i occurring before j, the possible hazards are,

RAW (read after write) : j tries to read a source before i writes it.

WAW (write after write): j tries to write an operand before it is written by i. For example,

LW R1, 0(R2) IF ID EX MEM1 MEM2 WB

DADD R1, R2, R3 IF ID EX WB

WAR (write after read): j tries to write a destination before it is read by i. For example, if read is done in the second half of MEM2, and write is done in the first half of WB.

SW 0(R1), R2 IF ID EX MEM1 MEM2 WB

DADD R2, R3, R4 IF ID EX WB

RAR (read after read): not a hazard.

#

Data Hazards Requiring Stalls

The load instruction can forward the results to AND and OR instruction, but not to the SUB instruction since that would mean forwarding results in negative time

#


The load interlock causes a stall to be inserted at clock cycle 4, delaying the SUB instruction and those that follow by one cycle.

This delay allows the value to be successfully forwarded onto the next clock cycle

#


Pipeline interlock

A piece of hardware that detects a hazard and stalls the pipeline until the hazard is cleared.

#

Control Hazards

Caused by the instructions that change PC.

Some basics

If a branch changes the PC to its target address, it is a taken branch. If it does not change the PC, it falls through or it is not taken.

Recall that if an instruction i is a taken branch, the PC is normally not changed until the end of ID. A stall cycle is required.

Branch InstructionIF ID EX MEM WB

Branch successor IF IF ID EX MEM WB

Branch successor+1 IF ID EX MEM WB

Branch successor+2 IF ID EX MEM WB

#

Branch Penalty

Branch delay: The length of a control hazard.

Branch penalty: The branch delay, unless it is dealt with, turns into branch penalty.

The deeper the pipeline, the worse the branch penalty.

The number of branch stalls can be reduced by two steps

Find out whether the branch is taken or not taken earlier in the pipeline.

Compute the taken PC (i.e., the address of the branch target) earlier.

Branch behavior in programs

Average frequency of taken branches : 67%

60% of the forward branches are taken.

85% of the backward branches are taken.

#

Reducing Branch Penalties

Static branch prediction methods (Compile-time guess).

Free or flush the pipeline

Holding or deleting any instructions after the branch until the branch destination is known.

Predict-not-taken (untaken) (Fig. A.12 in A-23)

Predict-taken

Does it have any advantage? Ans: no.

Delayed branch:

The execution cycle with a branch delay n is

Branch instruction

Sequential successor 1

Sequential successor 2

Sequential successor n (n=1 for MIPS)

Branch target if taken

#

Scheduling the Branch Delay Slot

#

Branch Delay Slots

Requirements for being effective

Scheduling from before: Always

Scheduling from target: Taken

Scheduling from fall through: Not taken

The limitation on delayed-branch scheduling arises from

The restrictions on the instructions that are scheduled into the delay slots.

The ability to predict at compile time whether a branch is likely to be taken or not.

Using canceling or nullified branch to relieve the limits

In a canceling branch, the instruction includes the direction that the branch was predicted. When the branch behaves as predicted, the instruction in the branch delay slot is simply executed. Otherwise, the instruction in the branch delay slot is simply turned into a No-Op.

#

How Is Pipelining Implemented?

Unpipelined 5-cycle implementation

#

Simple Pipelining Implementation for MIPS

#

Implementing the Control for MIPS Pipeline

Implementing the control focuses on detecting of hazards and generating of control signals for forwarding.

Hazard detection

All the data hazards can be checked and forwarding control signals can be set during the ID phase. If a data hazard exists, the instruction is stalled before it is issued.

Or, alternatively, hazards forwarding are checked at the beginning of a clock cycle that uses an operand (EX and MEM for the MIPS pipeline).

Implementing the logic for hazard detection

Hazard detection by comparing the destination and sources of adjacent instructions.

An example shows detecting of all load interlocks when the instruction using the load result in the ID stage (

#

Implementing Forwarding Logic

Forwarding sources: ALU or data memory output.

Forwarding destination: ALU input, data memory input, or zero detection unit (for BRANCH).

The forwarding can be implemented by checking the following conditions

EX/MEM.IR.destination =ID/EX.IR.source ?

MEM/WB.IR.destination = ID/EX.IR.source ?

MEM/WB.IR.destination = EX/MEM.IR.source?

#

Forwarding Data to the Two ALU Inputs

#

Dealing with Branches in the Pipeline

#

What Makes Pipelining Hard to Implement

Exception (interrupt, fault) makes pipelining difficult to implement.

Instruction set complications

#

Types of Exceptions

Types

I/O device request

Invoking an OS service from a user program

Tracing instruction execution

Breakpoint

Integer arithmetic overflow or underflow

FP arithmetic anomaly

Page fault

Misaligned memory access

Memory-protection violation

Using an undefined instruction

Hardware malfunction

Power failure

Exceptions for different architecture

#

Classification of Exceptions

Synchronous versus asynchronous

If the event occurs at the same place every time that the program is executed with the same data and memory allocation, the event is called synchronous.

User requested versus coerced

User maskable versus nonmaskable

Within versus between instruction

Depend on whether the event prevents instruction completion by occurring in the middle of execution or whether it is recognized between instructions.

Resume versus terminate .

#

Action Requirements for Different Exception Types

Actions

Resume

Terminate

The most difficult exceptions have two properties:

They occur within instructions (i.e. at EX or MEM stages).

They must be restartable (must save the PC of the instruction at which to restart).

#

Exception Handling

Stopping and restarting execution

Force a trap instruction on the next IF

Until the trap is taken, turn off all writes for the faulting instruction and for all instructions that follow in the pipeline.

After the exception-handling routine in the operating system receives control, it immediately saves the PC of the faulting instruction.

IFIDEXMEMWB IFIDEX

If delayed branch is used, we need to save and restore as many PCs as the length of the branch delay plus one.

#

Precise Interrupt

If a pipeline can be stopped so that the instructions just before the faulting instruction are completed and those after it can be restarted from scratch.

Supporting precise interrupts is a requirement in many systems.

Exceptions in DLX

With pipelining, multiple exceptions may occur in the same clock cycle.

#

Implementations of Precise Exceptions

Principle

The pipeline should be able to handle the exceptions caused by instruction i prior to the exceptions caused by instruction i+1.

Implementation

Hardware posts all exceptions caused by a given instruction in a status vector associated that instruction.

Once an exception indication is set in the exception status vector, any control signal that may cause a data value to be written is turned off.

When an instruction enters WB, the exception status vector is checked, if any exceptions are posted, they are handled in the order in which they would occur in time on an unpipelined machine.

This will guarantee that all exceptions will be seen on instruction i before any are seen on i+1.

#

Instruction Committed

When an instruction is guaranteed to complete, it is called committed.

In the MIPS pipeline, all instructions are committed when they reach the end of the MEM stage and no instruction updates the state before that stage. Thus precise exceptions are straight forward.

#

Instruction Set Complications

Some machines have instructions that change the state in the middle if the instruction execution.

VAX: Autoincrement addressing mode.

VAX or IBM 360: String copy.

Implicitly set condition code.

Cause difficulties in scheduling any pipeline delays between setting condition code and the branch.

ADD XXX

Documents

Pipelining Computer Archiitecture