Pipelining - Yonsei

DSP VLSI Design

Pipelining

Byungin Moon

Yonsei University

1YONSEI UNIVERSITYDSP VLSI Design

PipeliningOutline

What is pipelining?Performance advantage of pipeliningPipeline depthInterlocking

Due to resource contentionDue to data dependency

Branching EffectsInterrupt EffectsPipeline programming modes

Time-stationaryData-stationary


PipeliningPipeliningDefinition

A technique for increasing the performance of a processor (or other electronic system) by a sequence of operations into smaller pieces and executing these pieces in parallel when possible

Used in almost all current DSP processorsStrength

Decreases the overall time required to the complete the set of operations

WeaknessComplicates programming

Execution time of a specific instruction sequence can vary from case to caseCertain instruction sequences must be avoid for correct program operation

Represents a trade-off between efficiency and ease of use


Pipelining

Illustration of how pipelining increases performance on a Hypothetical ProcessorA hypothetical processor uses separate execution units to accomplish the following actions for a single instruction (each stage takes 20 ns to execute) – resembles TI TMS320C3x

Fetch and instruction word from memoryDecode the instructionRead or write a data operand from or to memoryExecute the ALU or MAC portion of the instruction

NonpipelinedThe four stages is executed sequentiallyExecution time of 80 ns per instructionEach stage is idle 75 % of the time

PipelinedThe four stages of execution are overlappedExecutes a new instruction every 20 nsAn instruction appears to the programmer to execute in one instruction cycleInstructions appear to execute sequentially


Pipelining

Performance Comparison(Nonpipelined vs. Pipelined)


PipeliningPipeline DepthNumber of pipeline stages

Vary from one processor to anotherA deeper pipeline

Allows the processor to execute fasterBut makes the processor harder to program

Most processors use three or four stagesThree-stage pipeline

Instruction fetch, decode, and executeOperand fetch is typically done in the latter part of the decodestage

Four-stage pipelineInstruction fetch, decode, operand fetch, and execute

OthersAnalog Devices processors (two stages) and TI TMS320C54x (five stages)


PipeliningWhat is Interlocking?Resource Contention

Pipelined processors may not perform as well as we have shown in the hypothetical exampleMainly due to resource contention (conflict) Example

Suppose it takes two instruction cycles to write to memory (likeAT&T DSP16xx processors)Instruction I2 attempts to write to memory and I3 needs to read from memoryThe second cycle in I2’s data write phase conflicts with I3’s data read

Solution to resource contention -> InterlockingInterlocking

Delays the progression of the latter of the conflicting instructions through the pipeline


Pipelining

Example of Pipeline Resource Contention andInterlocking to Resolve Resource Contention


Pipelining

Complicated programming on Interlocking Pipelined ProcessorsThere is a number of interlocking sources

For example, In processors supporting instructions with long immediate data

Instruction with long immediate data require an additional program memory fetch to get the immediate dataThis long immediate data fetch conflicts with the fetch of the next instruction -> resulting interlocking

Not easy to spot interlocks by reading the program code

The pipeline is interlocked or not depending on the instructions that surrounds it

For example, if the instruction I3 in the previous example did not need to read from data memory,Then there would be not conflict, and no interlock would occur


PipeliningData Hazard – Another Interlocking Source

Example from the Motorola DSP5600xMakes little use of interlockingUses a three-stage pipeline

FetchDecode – addresses used in data accesses are formedExecute – ALU operation, data accesses, register loads

Example codeMOVE #$1234, R0MOVE X : (R0), X0

(R0 contains the hexadecimal value 5678 before execution of the above)Seemingly

The above instructions move the value stored at X memory address 1234 into register X0

ActuallyThe above instructions move the value stored at X memory address 5678This is because of a pipeline hazard resulting from data dependency


PipeliningA Motorola DSP5600x Pipeline Hazard


PipeliningData Hazard – Another Interlocking Source

Interlocking to protect the programmer from the hazard

TI TMS320C3x, TMS320C4x, and TMS320C5x processors

TMS320C3x detect writes to any of its address registers and holds the progression through the pipeline of other instructions that use any address register until the write has completed

Trade-off made by heavily interlocked processorsSaves the programmer from worrying about whether certain instruction sequences will produce correct outputAllows the programmer to write slower-than-optimal code, even without even realizing it


Pipelining

Interlocking to Solve the Pipeline Hazard(from TI TMS320C3x)

LDI (load immediate) instruction loads a value into an address registerMPYF (floating-point multiply) uses register-indirect addressing fetch one of its operands


PipeliningBranching EffectsControl dependency from branches

When a branch instruction reaches the decode stage in the pipeline and realizes that it must begin executing at a new address, the next sequential instruction word has already been fetched and is in the pipelineAfter the processor realizes a branch instruction, it didn’t know where the next instruction is located until the branch is resolved

One solution – multicycle branchDiscard, or flush the unwanted instructionAnd cease fetching new instructions until the branch is resolvedResults in some waste cycles

Some processors use tricks to execute the branch late in the decode phase, saving one instruction cycle

Almost all DSP processors use multicycle branches


PipeliningBranch Effects

Alternative to the multicycle branch – delayed branchSeveral instructions following the branch are executed normally

BRD NEW_ADDRINST 2 ; INST2 to INST4INST 3 ; are executed beforeINST 4 ; the branch occurs

Instructions that will be executed before the branch instruction must be located in memory after the branchThe branch appears to be delayed in its effect by several instruction cycles

TMS320C3x, TMS320C4x, TMS320C5x, ADSP-2100x, DSP32C, DSP32xx, and ZR3800x

Trade-offs of multicycle and delayed branchesease of programming and efficiency, as with interlockingCan always place NOP instructions after a delayed branch, in theworst case

Branch effects occur whenever there is a change in program flowSubroutine call instructions, subroutine return instructions, and return from interrupt instructions


PipeliningMulticycle Branch vs. Delayed Branch


PipeliningInterrupt Effects

Interrupts have the effects similar to branches on the pipeline

Interrupts typically involve a change in a program flow of control to branch to the interrupt service routineThe pipeline often increases the processor’s interrupt response time, much as it slows down branch execution

When an interrupt occurs,Almost all processors allow instructions at the decode stage or further in the pipeline to finish executing, because these instructions may be partially executed.

What occurs past this pointVaries from processor to processor


PipeliningExample from TI TMS320C5x

One cycle after the interrupt is recognized the processor inserts an INTR instruction into the pipeline

INTR is a special branch instruction that causes the processor to begin execution at the appropriate interrupt vectorCauses a four-instruction delay before the first word of the interrupt vector


PipeliningNormal Interrupts of Motorola DSP5600x

DSP5600x does not use an INTR instructionSimply begins fetching from the vector location after the interrupt is recognizedAt most two words are fetched starting at this addressOne of the two words is a subroutine callFlushes the previously fetched instruction and then branches to the long interrupt vector


PipeliningFast Interrupts of Motorola DSP5600x

The same as normal interrupts exceptNeither of the two words starting at the interrupt vector is a subroutine callThe processor executes the two words and continues executing from the original program


PipeliningPipeline Programming Models

Two major assembly code formats for pipelined processorsTime-stationary

The processor’s instruction specify the action to be performed by the execution units during a single instruction cycle

(example from AT&T DSP16xx) a0=a0+p p=x*y x=*r0++ y=*pt++Each portion of the instruction operates on separate operandsRelated to operand-unrelated parallel movesMore flexible

Data-stationarySpecifies the operations that are to be performed, but not the exact timings during which the actions are to be executed

(example from AT&T DSP32xx) a1 = a1 + (*r5++ = *r4++) * *r3++Related to operand-related parallel movesEasier to read


Pipelining

Two Basic Control Schemes for Pipelined Data PathsData-stationary

Passes control function code along with dataAllows simple and straight-forward design of both the state sequencer and the data path control circuits for each stageRequires more layout area

Time-stationaryProvides the control signals for the entire pipeline from a single source external to the pipelineThe central controller govern the entire state of the machine at each time unitMore complex design

Must remember the current pipe state and provides appropriate control signals for each pipe stage


Pipelining

Two Basic Control Schemes for Pipelined Data Paths

Data-stationary Time-stationary

Documents

Pipelining - Yonsei