Pipeline

Pipelining

Advanced Computer Architecture

Pipeline Design• Instruction Pipeline Design

– Instruction Execution Phases– Mechanism for Instruction Pipelining– Dynamic Instruction Scheduling– Branch Handling Techniques

• Arithmetic Pipeline Design– Computer Arithmetic Principles– Static Arithmetic Pipelines– Multifunctional Arithmetic Pipelines

Typical Instruction Pipeline

• A typical instruction execution includes a sequence of operations which includes:-– Instruction Fetch (F)– Decode (D)– Operand Fetch or Issue (I)– Execute, several stages (E)– Write Back (W)

Source: Kai Hwang

Instruction Execution Phases

• Each operation (F, D, I, E, W) may require one clock cycle or more. Ideally, these operations need to be overlapped.

• Example (assumptions):-– load and store instructions take four cycles– add and multiply instructions take three cycles

Shaded regions indicate idle cycles due to dependenciesSource: Kai Hwang

Mechanisms for Instruction Pipelining

• Goal: Achieve maximum parallelism in pipeline by smoothening the instruction flow and minimizing the idle cycles

• Mechanisms:-– Prefetch Buffers– Multiple Functional Units– Internal Data Forwarding– Hazard Avoidance

Prefetch Buffers

• Used to match the instruction fetch rate to the pipeline consumption rate

• In a single memory access, a block of consecutive instructions are fetched into a prefetch buffer

• Three types of prefetch buffers:-– Sequential buffers, used to store sequential instructions– Target buffers, used to store branch target instructions– Loop buffer, used to store loop instructions

Source: Kai Hwang

Multiple Functional Units

• At times, a specific pipeline stage becomes the bottleneck

• Identified by large number of checksin a row in reservation table

• To resolve dependencies, we use reservation stations

• Each RS is uniquely identified with a tag monitored by tag unit (Register Tagging)

• Helps in conflict resolutionand serving as buffer

Source: Kai Hwang

Internal Data Forwarding

• Goal: Memory access operations to be replaced with register transfer operations

• Types:-– Store load forwarding– Load load forwarding– Store store forwarding

Source: Kai Hwang

Hazard Avoidance

• Read/write of shared variables by different instructions in pipeline may lead to different results if instructions are executed out of order

• Types:-– Read after Write (RAW) Hazard– Write after Write (WAW) Hazard– Write after Read (WAR) Hazard

Source: Kai Hwang

Instruction Scheduling

• Aim: To schedule instructions through an instruction pipeline

• Types of instruction scheduling:-– Static Scheduling

• Supported by optimizing compiler

– Dynamic Scheduling• Achieved by Tomasulo’s register-tagging

scheme• Using scoreboarding scheme

Static Scheduling

• Data dependency in a sequence of instructions create interlocked relationships

• Interlocking can be resolved by compiler by increasing separation between interlocked instructions

• Example:

Two independent load instructionscan be moved ahead so that spacingbetween them and multiply instructionis increased.

Tomasulo’s Algorithm

• Hardware dependent scheme• Data operands are saved in

Register Station (RS) untildependencies get resolved

• Register tagging is used toallocate/deallocate register

• All working registers are tagged

Source: Kai Hwang

Scoreboarding

• Multiple functional units appear in multiple execution pipelines. Parallel units allow instruction to execute out of order w.r.t. original program sequence.

• Processor has instruction buffers, instructions are issues regardless of the availability of their operands.

• Centralized control units called scoreboard is used to keep track of unavailable operands for instructions stored in buffer

Source: Kai Hwang

Branch Handling Techniques

• Pipeline performance is limited by presence of branch instructions in program

• Various branch strategies are applied to minimize performance degradation

• To evaluate branch strategy, two approaches can be followed– Trace data approach– Analytical analysis

• Effect of branching … contd.

Branching Illustrated

• Ib: Branch Instruction

• Once branch taken is decided,all instructions are flushed

• Subsequently, all theinstructions at branchtarget are run

Source: Kai Hwang

Effect of Branching

• Nomenclature:– Branch Taken, action of fetching non-sequential

(remote) instructions after branch instruction– Branch Target, (remote) instruction to be executed

after branch taken– Delay Slot (b), number of pipeline cycles

consumed between branch taken and branch target

• In general,0 <= b <= k-1

where k is number of pipeline stages

Effect of Branching

• When branch taken occurs, all instruction after branch instruction become useless, pipeline is flushed, loosing number of cycles

• Let Ib be branch instruction, then branch taken shall cause all instructions from Ib+1 till Ib+k-1 to be drained from pipeline

• Let p be probability of instruction to be branch instruction and q be probability of branch taken, then penalty, in terms of time is expressed as Tpenalty = pqnbt , where

n: number of instructions; b: number of pipeline cycles consumed; t: cycle time

• Effective execution time becomes Teff = kt + (n-1)t + pqnbt

Branch Prediction

• Branch can be predicted based on– Static Branch Strategy

• Probability of branch with respect to a particular branch type can be used to predict branch

• Probability may be obtained by collecting frequency of branch taken and branch types across large number of program traces

– Dynamic Branch Strategy• Uses limited recent branch history to predict

whether or not branch will be taken when it occurs next time

Branch Prediction Internals

• Branch prediction buffer– Used to store the branch historyinformation in order to make branchprediction

• State transition diagram used in dynamic branch prediction

Source: Kai Hwang

Delayed Branches

• Branch penalty can be reduced by the concept of delayed branch

• The central idea is to delay the execution of branch instruction to accommodate independent* instructions– Delaying by d cycles allows few useful

instructions (independent*) of branch instructions to be executed

* Execution of these instructions should be independent of outcome of branch instruction

Linear Pipeline Processors

• A linear pipeline processor is constructed with k processing stages i.e. S1 … Sk

• These stages are linearly connected to perform a specific function

• Data stream flows from one end of the pipeline to another end, external inputs are fed into S1 and final results move out from Sk , intermediate results pass from Si to Si+1

• Linear pipelining applied to:-– Instruction execution– Arithmetic computation– Memory access operations

Asynchronous Model

• Data flow between adjacent stages is controlled by handshaking protocol– When a stage Si is ready to transmit, it sends a

ready signal to stage Si+1

– This is followed by the actual data transfer

– After stage Si+1 receives the data, it returns an acknowledge signal to Si

Source: Kai Hwang

Contd…

• Asynchronous pipelines are useful in designing communication channels in message passing multicomputers

• They have variable throughput rate.• Different amount of delays may be

experienced in different stages.

Synchronous Model

• Clocked latches are used to interface between stages– Latches are master-slave flip flops that isolate

inputs from outputs. Upon arrival of a clock pulse, all latches transfer data to next stage at same time.

• Pipeline stages are combinational circuits.

Source: Kai Hwang

Contd…

• It is desirable to have equal delays in all stages.

• These delays determine the clock period and thus the speed of the pipeline.

Reservation Table

• It specifies the utilization pattern of successive stages in a synchronous pipeline

• Space time graph depicting precedence relationship in using the pipeline stages

• For a k-stage linear pipeline clock cycles are needed for data to flow through the pipeline

Clocking and Timing Control

• Clock cycle and throughput:– Clock cycle time (t) of a pipeline is given below

t = tm + d

where

tm denote maximum stage delay

d denote latch delay

– Pipeline frequency (1/t) is referred as throughput of the pipeline

• Clock skewing:– Ideally clock pulses should arrive at all stages at same time, but

due to clock skewing, same clock pulse may arrive at different stages with an offset of s

– Further, let tmax be time delay of longest logic path in a stage and tmin be that of shortest logic path in a stage, then

d + tmax + s <= t <= tm + tmin - s

Speedup

• Case 1: Pipelined processor– Ideally, number of clock cycles required by a k stage

pipeline to process n tasks is:- Np = k + (n-1)

(k clock cycles for first task & 1 clock cycle for each of n-1 tasks)

– Total time required is Tk = (k+(n-1))t

• Case 2: Non-pipelined processor– Non-pipelined processor would take time, T1 = nkt

• Speedup Factor: Sk of a k-stage pipeline over an equivalent non pipelined processor is:

Sk = T1 / Tk = nkt / (k+ (n-1))t = nk / (k + n-1))

Optimal number of stages:

• Most pipelining is staged as functional level with 2≤k≤15.

• Very few pipelines are designed to exceed 10 stages in real computers.

• Optimal choice of number of pipeline stages should be able to maximize the performance/cost ratio for target processing load.

• Performance/cost ratio(PCR)=f/c+kh• =1/(t/k+d)(c+kh)

where f=1/(t/k+d)• PCR corresponds to the optimal choice for the

number of desired pipeline stages: k0 =√t.c/d.h,where t is the total flow-through delay of pipeline,c is total stage cost,d is latch delay and latch cost h.

Efficiency & Throughput

• Efficiency: It is defined as speedup factor divided by number of stages:-

Ek = Sk / k = n / (k + (n-1))

• Pipeline Throughput: It is defined as number of tasks per unit time as below:-

Hk = n / (k + (n-1))t = nf / (k + (n-1))

Documents

Pipeline