Upload
umar72
View
7
Download
0
Embed Size (px)
DESCRIPTION
pipelining as used in computer architecture
Citation preview
Slide 1
Chapter 3
Instruction-Level Parallelism and Its Exploitation
Computer Architecture
A Quantitative Approach, Fifth Edition
#
#
The University of Adelaide, School of Computer Science
31 March 2015
Chapter 2 Instructions: Language of the Computer
1
Copyright 2012, Elsevier Inc. All rights reserved.
Introduction
Pipelining become universal technique in 1985
Overlaps execution of instructions
Exploits Instruction Level Parallelism
Beyond this, there are two main approaches:
Hardware-based dynamic approaches
Used in server and desktop processors
Not used as extensively in PMP processors
Compiler-based static approaches
Not as successful outside of scientific applications
Introduction
#
The University of Adelaide, School of Computer Science
31 March 2015
Chapter 2 Instructions: Language of the Computer
2
Basic and Intermediate Concept
What is Pipelining?
Pipelining is an implementation technique whereby multiple instructions are overlaped in execution.
Pipe stage (pipe segment)
Throughput
Machine cycle: The time required between moving an instruction one step down the pipeline. This time is equal to the time required for the slowest pipe stage.
In a computer, the machine cycle is usually one clock cycle.
The pipeline designers goal is to balance the length of each pipe stage.
If the stages are perfectly balanced,
#
A Simple Implementation of A RISC ISA
Five-cycle implementation
Instruction fetch cycle (IF)
Instruction decode/register fetch cycle (ID)
Operand fetches;
Sign-extending the immediate field;
Decoding is done in parallel with reading registers. This technique is known as fixed-field decoding;
Test branch condition and computed branch address; finished branching at the end of this cycle.
Execution/effective address cycle (EX)
Memory reference;
Register-Register ALU instruction;
Register-Immediate ALU instruction;
Memory access/branch completion cycle (MEM)
Write-back cycle (WB)
Register-Register ALU instruction;
Register-Immediate ALU instruction;
Load instruction;
#
Performance of the Five-Cycle Implementation
CPI=4.54
Branch instructions (12%) take 2 cycles
Store instructions (10%) require 4 cycles
Others takes 5 cycles
#
The Classic Five-Stage Pipeline
#
The RISC Pipeline with Registers
#
Instruction Issue
The process of letting an instruction move from the instruction decode stage (ID) into execution stage (EX) of this pipeline.
#
Basic Performance Issues in Pipelining
Pipelining increasing instruction execution throughput, but it does not reduce the execution time of an individual instruction due to pipeline overhead.
Register delay
Clock skew
The limitation of pipeline depth is due to
Pipeline latency
Pipe stage imbalance
Pipeline overhead
#
Performance of Pipeline with Stalls
When pipelining is thought of as decreasing the CPI,
#
When pipelining is thought of as improving the clock cycle time,
#
Copyright 2012, Elsevier Inc. All rights reserved.
Dependence
Dependencies are a property of programs
Pipeline organization determines if dependence is detected and if it causes a stall
Data dependence conveys:
Possibility of a hazard
Order in which results must be calculated
Upper bound on exploitable instruction level parallelism
Dependencies that flow through memory locations are difficult to detect
Introduction
#
The University of Adelaide, School of Computer Science
31 March 2015
Chapter 2 Instructions: Language of the Computer
12
Pipelining Hazards
A hazard is a situation that prevents the next instruction in the instruction stream from executing during its designated clock cycle.
Three classes of hazards
Structural hazard: Arise from resource conflicts.
Data hazard: Arise when an instruction depends on the results of a previous instruction.
Control hazard: Arise from branches and other instructions that change the PC.
A pipeline can be stalled by a hazard. To eliminate hazards,
Instructions issued later than the stalled instruction are also stalled.
Instructions issued earlier than the stalled one must continue.
Note that a cache miss stalls the whole pipeline.
#
Structural Hazards
If certain combination of instructions cant be accommodated because of resource conflicts, the machine is said to have a structural hazard
It can be generated by:
Some functional unit is not fully pipelined
Some resources has not been duplicated enough to allow all the combinations in the pipeline to execute
For example: a machine may have only one register file write port, but under certain conditions, the pipeline might want to perform two writes in one clock cycle this will generate structural hazard
When a sequence of instructions encounter this hazard, the pipeline will stall one of the instructions until the required unit is available
Such stalls will increase the Clock cycle Per Instruction from its ideal 1 for pipelined machines
#
Structural Hazards
Due to resource conflicts Due to some functional unit being not fully pipelined.
When some resources have not been duplicated enough.
#
Structural Hazards
#
16
To solve the problem, a stall cycle is added. The effect of the pipeline bubble is actually to occupy the resources for that instruction slot as it travels through the pipeline. Performance wise, instruction 3 will not complete during clock cycle 8, but during clock cycle 9.
Structural Hazards
Another way to represent the stall no instruction is initiated in clock cycle 4
Instruction NumberClock number12345678910loadIFIDEXMEMWBInstruction i+1IFIDEXMEMWBInstruction i+2IFIDEXMEMWBInstruction i+3stallIFIDEXMEMWBInstruction i+4IFIDEXMEMWBInstruction i+5IFIDEXMEM#
17
Sometime those diagrams are drawn with a stall occupying a whole raw, with instruction 3 being moved to the next raw. In either case, the effect is the same. The instruction 3 is not beginning execution until cycle 5.
Structural Hazards
A machine with structural hazard will have higher CPI
Why a designer allows structural hazard?
To reduce cost
Pipelining all the functional units or duplicating them may be too costly
To reduce latency
Introducing too many pipeline stages may cause latency issues
#
Data Hazards
A memory access depends on the results of unfinishing instructions.
#
Forwarding ALU Results
#
Forwarding Results to Store
#
Bypassing Results of LOAD
#
Data Hazard Classification
Consider two instructions i and j, with i occurring before j, the possible hazards are,
RAW (read after write) : j tries to read a source before i writes it.
WAW (write after write): j tries to write an operand before it is written by i. For example,
LW R1, 0(R2) IF ID EX MEM1 MEM2 WB
DADD R1, R2, R3 IF ID EX WB
WAR (write after read): j tries to write a destination before it is read by i. For example, if read is done in the second half of MEM2, and write is done in the first half of WB.
SW 0(R1), R2 IF ID EX MEM1 MEM2 WB
DADD R2, R3, R4 IF ID EX WB
RAR (read after read): not a hazard.
#
Data Hazards Requiring Stalls
The load instruction can forward the results to AND and OR instruction, but not to the SUB instruction since that would mean forwarding results in negative time
#
Data Hazards Requiring Stalls
The load interlock causes a stall to be inserted at clock cycle 4, delaying the SUB instruction and those that follow by one cycle.
This delay allows the value to be successfully forwarded onto the next clock cycle
#
Data Hazards Requiring Stalls
Pipeline interlock
A piece of hardware that detects a hazard and stalls the pipeline until the hazard is cleared.
#
Control Hazards
Caused by the instructions that change PC.
Some basics
If a branch changes the PC to its target address, it is a taken branch. If it does not change the PC, it falls through or it is not taken.
Recall that if an instruction i is a taken branch, the PC is normally not changed until the end of ID. A stall cycle is required.
Branch InstructionIF ID EX MEM WB
Branch successor IF IF ID EX MEM WB
Branch successor+1 IF ID EX MEM WB
Branch successor+2 IF ID EX MEM WB
#
Branch Penalty
Branch delay: The length of a control hazard.
Branch penalty: The branch delay, unless it is dealt with, turns into branch penalty.
The deeper the pipeline, the worse the branch penalty.
The number of branch stalls can be reduced by two steps
Find out whether the branch is taken or not taken earlier in the pipeline.
Compute the taken PC (i.e., the address of the branch target) earlier.
Branch behavior in programs
Average frequency of taken branches : 67%
60% of the forward branches are taken.
85% of the backward branches are taken.
#
Reducing Branch Penalties
Static branch prediction methods (Compile-time guess).
Free or flush the pipeline
Holding or deleting any instructions after the branch until the branch destination is known.
Predict-not-taken (untaken) (Fig. A.12 in A-23)
Predict-taken
Does it have any advantage? Ans: no.
Delayed branch:
The execution cycle with a branch delay n is
Branch instruction
Sequential successor 1
Sequential successor 2
Sequential successor n (n=1 for MIPS)
Branch target if taken
#
Scheduling the Branch Delay Slot
#
Branch Delay Slots
Requirements for being effective
Scheduling from before: Always
Scheduling from target: Taken
Scheduling from fall through: Not taken
The limitation on delayed-branch scheduling arises from
The restrictions on the instructions that are scheduled into the delay slots.
The ability to predict at compile time whether a branch is likely to be taken or not.
Using canceling or nullified branch to relieve the limits
In a canceling branch, the instruction includes the direction that the branch was predicted. When the branch behaves as predicted, the instruction in the branch delay slot is simply executed. Otherwise, the instruction in the branch delay slot is simply turned into a No-Op.
#
How Is Pipelining Implemented?
Unpipelined 5-cycle implementation
#
Simple Pipelining Implementation for MIPS
#
Implementing the Control for MIPS Pipeline
Implementing the control focuses on detecting of hazards and generating of control signals for forwarding.
Hazard detection
All the data hazards can be checked and forwarding control signals can be set during the ID phase. If a data hazard exists, the instruction is stalled before it is issued.
Or, alternatively, hazards forwarding are checked at the beginning of a clock cycle that uses an operand (EX and MEM for the MIPS pipeline).
Implementing the logic for hazard detection
Hazard detection by comparing the destination and sources of adjacent instructions.
An example shows detecting of all load interlocks when the instruction using the load result in the ID stage (
#
Implementing Forwarding Logic
Forwarding sources: ALU or data memory output.
Forwarding destination: ALU input, data memory input, or zero detection unit (for BRANCH).
The forwarding can be implemented by checking the following conditions
EX/MEM.IR.destination =ID/EX.IR.source ?
MEM/WB.IR.destination = ID/EX.IR.source ?
MEM/WB.IR.destination = EX/MEM.IR.source?
#
Forwarding Data to the Two ALU Inputs
#
Dealing with Branches in the Pipeline
#
What Makes Pipelining Hard to Implement
Exception (interrupt, fault) makes pipelining difficult to implement.
Instruction set complications
#
Types of Exceptions
Types
I/O device request
Invoking an OS service from a user program
Tracing instruction execution
Breakpoint
Integer arithmetic overflow or underflow
FP arithmetic anomaly
Page fault
Misaligned memory access
Memory-protection violation
Using an undefined instruction
Hardware malfunction
Power failure
Exceptions for different architecture
#
Classification of Exceptions
Synchronous versus asynchronous
If the event occurs at the same place every time that the program is executed with the same data and memory allocation, the event is called synchronous.
User requested versus coerced
User maskable versus nonmaskable
Within versus between instruction
Depend on whether the event prevents instruction completion by occurring in the middle of execution or whether it is recognized between instructions.
Resume versus terminate .
#
Action Requirements for Different Exception Types
Actions
Resume
Terminate
The most difficult exceptions have two properties:
They occur within instructions (i.e. at EX or MEM stages).
They must be restartable (must save the PC of the instruction at which to restart).
#
Exception Handling
Stopping and restarting execution
Force a trap instruction on the next IF
Until the trap is taken, turn off all writes for the faulting instruction and for all instructions that follow in the pipeline.
After the exception-handling routine in the operating system receives control, it immediately saves the PC of the faulting instruction.
IFIDEXMEMWB IFIDEX
If delayed branch is used, we need to save and restore as many PCs as the length of the branch delay plus one.
#
Precise Interrupt
If a pipeline can be stopped so that the instructions just before the faulting instruction are completed and those after it can be restarted from scratch.
Supporting precise interrupts is a requirement in many systems.
Exceptions in DLX
With pipelining, multiple exceptions may occur in the same clock cycle.
#
Implementations of Precise Exceptions
Principle
The pipeline should be able to handle the exceptions caused by instruction i prior to the exceptions caused by instruction i+1.
Implementation
Hardware posts all exceptions caused by a given instruction in a status vector associated that instruction.
Once an exception indication is set in the exception status vector, any control signal that may cause a data value to be written is turned off.
When an instruction enters WB, the exception status vector is checked, if any exceptions are posted, they are handled in the order in which they would occur in time on an unpipelined machine.
This will guarantee that all exceptions will be seen on instruction i before any are seen on i+1.
#
Instruction Committed
When an instruction is guaranteed to complete, it is called committed.
In the MIPS pipeline, all instructions are committed when they reach the end of the MEM stage and no instruction updates the state before that stage. Thus precise exceptions are straight forward.
#
Instruction Set Complications
Some machines have instructions that change the state in the middle if the instruction execution.
VAX: Autoincrement addressing mode.
VAX or IBM 360: String copy.
Implicitly set condition code.
Cause difficulties in scheduling any pipeline delays between setting condition code and the branch.
ADD XXX