Upload
others
View
17
Download
0
Embed Size (px)
Citation preview
DSP VLSI Design
Pipelining
Byungin Moon
Yonsei University
1YONSEI UNIVERSITYDSP VLSI Design
PipeliningOutline
What is pipelining?Performance advantage of pipeliningPipeline depthInterlocking
Due to resource contentionDue to data dependency
Branching EffectsInterrupt EffectsPipeline programming modes
Time-stationaryData-stationary
2YONSEI UNIVERSITYDSP VLSI Design
PipeliningPipeliningDefinition
A technique for increasing the performance of a processor (or other electronic system) by a sequence of operations into smaller pieces and executing these pieces in parallel when possible
Used in almost all current DSP processorsStrength
Decreases the overall time required to the complete the set of operations
WeaknessComplicates programming
Execution time of a specific instruction sequence can vary from case to caseCertain instruction sequences must be avoid for correct program operation
Represents a trade-off between efficiency and ease of use
3YONSEI UNIVERSITYDSP VLSI Design
Pipelining
Illustration of how pipelining increases performance on a Hypothetical ProcessorA hypothetical processor uses separate execution units to accomplish the following actions for a single instruction (each stage takes 20 ns to execute) – resembles TI TMS320C3x
Fetch and instruction word from memoryDecode the instructionRead or write a data operand from or to memoryExecute the ALU or MAC portion of the instruction
NonpipelinedThe four stages is executed sequentiallyExecution time of 80 ns per instructionEach stage is idle 75 % of the time
PipelinedThe four stages of execution are overlappedExecutes a new instruction every 20 nsAn instruction appears to the programmer to execute in one instruction cycleInstructions appear to execute sequentially
4YONSEI UNIVERSITYDSP VLSI Design
Pipelining
Performance Comparison(Nonpipelined vs. Pipelined)
5YONSEI UNIVERSITYDSP VLSI Design
PipeliningPipeline DepthNumber of pipeline stages
Vary from one processor to anotherA deeper pipeline
Allows the processor to execute fasterBut makes the processor harder to program
Most processors use three or four stagesThree-stage pipeline
Instruction fetch, decode, and executeOperand fetch is typically done in the latter part of the decodestage
Four-stage pipelineInstruction fetch, decode, operand fetch, and execute
OthersAnalog Devices processors (two stages) and TI TMS320C54x (five stages)
6YONSEI UNIVERSITYDSP VLSI Design
PipeliningWhat is Interlocking?Resource Contention
Pipelined processors may not perform as well as we have shown in the hypothetical exampleMainly due to resource contention (conflict) Example
Suppose it takes two instruction cycles to write to memory (likeAT&T DSP16xx processors)Instruction I2 attempts to write to memory and I3 needs to read from memoryThe second cycle in I2’s data write phase conflicts with I3’s data read
Solution to resource contention -> InterlockingInterlocking
Delays the progression of the latter of the conflicting instructions through the pipeline
7YONSEI UNIVERSITYDSP VLSI Design
Pipelining
Example of Pipeline Resource Contention andInterlocking to Resolve Resource Contention
8YONSEI UNIVERSITYDSP VLSI Design
Pipelining
Complicated programming on Interlocking Pipelined ProcessorsThere is a number of interlocking sources
For example, In processors supporting instructions with long immediate data
Instruction with long immediate data require an additional program memory fetch to get the immediate dataThis long immediate data fetch conflicts with the fetch of the next instruction -> resulting interlocking
Not easy to spot interlocks by reading the program code
The pipeline is interlocked or not depending on the instructions that surrounds it
For example, if the instruction I3 in the previous example did not need to read from data memory,Then there would be not conflict, and no interlock would occur
9YONSEI UNIVERSITYDSP VLSI Design
PipeliningData Hazard – Another Interlocking Source
Example from the Motorola DSP5600xMakes little use of interlockingUses a three-stage pipeline
FetchDecode – addresses used in data accesses are formedExecute – ALU operation, data accesses, register loads
Example codeMOVE #$1234, R0MOVE X : (R0), X0
(R0 contains the hexadecimal value 5678 before execution of the above)Seemingly
The above instructions move the value stored at X memory address 1234 into register X0
ActuallyThe above instructions move the value stored at X memory address 5678This is because of a pipeline hazard resulting from data dependency
10YONSEI UNIVERSITYDSP VLSI Design
PipeliningA Motorola DSP5600x Pipeline Hazard
11YONSEI UNIVERSITYDSP VLSI Design
PipeliningData Hazard – Another Interlocking Source
Interlocking to protect the programmer from the hazard
TI TMS320C3x, TMS320C4x, and TMS320C5x processors
TMS320C3x detect writes to any of its address registers and holds the progression through the pipeline of other instructions that use any address register until the write has completed
Trade-off made by heavily interlocked processorsSaves the programmer from worrying about whether certain instruction sequences will produce correct outputAllows the programmer to write slower-than-optimal code, even without even realizing it
12YONSEI UNIVERSITYDSP VLSI Design
Pipelining
Interlocking to Solve the Pipeline Hazard(from TI TMS320C3x)
LDI (load immediate) instruction loads a value into an address registerMPYF (floating-point multiply) uses register-indirect addressing fetch one of its operands
13YONSEI UNIVERSITYDSP VLSI Design
PipeliningBranching EffectsControl dependency from branches
When a branch instruction reaches the decode stage in the pipeline and realizes that it must begin executing at a new address, the next sequential instruction word has already been fetched and is in the pipelineAfter the processor realizes a branch instruction, it didn’t know where the next instruction is located until the branch is resolved
One solution – multicycle branchDiscard, or flush the unwanted instructionAnd cease fetching new instructions until the branch is resolvedResults in some waste cycles
Some processors use tricks to execute the branch late in the decode phase, saving one instruction cycle
Almost all DSP processors use multicycle branches
14YONSEI UNIVERSITYDSP VLSI Design
PipeliningBranch Effects
Alternative to the multicycle branch – delayed branchSeveral instructions following the branch are executed normally
BRD NEW_ADDRINST 2 ; INST2 to INST4INST 3 ; are executed beforeINST 4 ; the branch occurs
Instructions that will be executed before the branch instruction must be located in memory after the branchThe branch appears to be delayed in its effect by several instruction cycles
TMS320C3x, TMS320C4x, TMS320C5x, ADSP-2100x, DSP32C, DSP32xx, and ZR3800x
Trade-offs of multicycle and delayed branchesease of programming and efficiency, as with interlockingCan always place NOP instructions after a delayed branch, in theworst case
Branch effects occur whenever there is a change in program flowSubroutine call instructions, subroutine return instructions, and return from interrupt instructions
15YONSEI UNIVERSITYDSP VLSI Design
PipeliningMulticycle Branch vs. Delayed Branch
16YONSEI UNIVERSITYDSP VLSI Design
PipeliningInterrupt Effects
Interrupts have the effects similar to branches on the pipeline
Interrupts typically involve a change in a program flow of control to branch to the interrupt service routineThe pipeline often increases the processor’s interrupt response time, much as it slows down branch execution
When an interrupt occurs,Almost all processors allow instructions at the decode stage or further in the pipeline to finish executing, because these instructions may be partially executed.
What occurs past this pointVaries from processor to processor
17YONSEI UNIVERSITYDSP VLSI Design
PipeliningExample from TI TMS320C5x
One cycle after the interrupt is recognized the processor inserts an INTR instruction into the pipeline
INTR is a special branch instruction that causes the processor to begin execution at the appropriate interrupt vectorCauses a four-instruction delay before the first word of the interrupt vector
18YONSEI UNIVERSITYDSP VLSI Design
PipeliningNormal Interrupts of Motorola DSP5600x
DSP5600x does not use an INTR instructionSimply begins fetching from the vector location after the interrupt is recognizedAt most two words are fetched starting at this addressOne of the two words is a subroutine callFlushes the previously fetched instruction and then branches to the long interrupt vector
19YONSEI UNIVERSITYDSP VLSI Design
PipeliningFast Interrupts of Motorola DSP5600x
The same as normal interrupts exceptNeither of the two words starting at the interrupt vector is a subroutine callThe processor executes the two words and continues executing from the original program
20YONSEI UNIVERSITYDSP VLSI Design
PipeliningPipeline Programming Models
Two major assembly code formats for pipelined processorsTime-stationary
The processor’s instruction specify the action to be performed by the execution units during a single instruction cycle
(example from AT&T DSP16xx) a0=a0+p p=x*y x=*r0++ y=*pt++Each portion of the instruction operates on separate operandsRelated to operand-unrelated parallel movesMore flexible
Data-stationarySpecifies the operations that are to be performed, but not the exact timings during which the actions are to be executed
(example from AT&T DSP32xx) a1 = a1 + (*r5++ = *r4++) * *r3++Related to operand-related parallel movesEasier to read
21YONSEI UNIVERSITYDSP VLSI Design
Pipelining
Two Basic Control Schemes for Pipelined Data PathsData-stationary
Passes control function code along with dataAllows simple and straight-forward design of both the state sequencer and the data path control circuits for each stageRequires more layout area
Time-stationaryProvides the control signals for the entire pipeline from a single source external to the pipelineThe central controller govern the entire state of the machine at each time unitMore complex design
Must remember the current pipe state and provides appropriate control signals for each pipe stage
22YONSEI UNIVERSITYDSP VLSI Design
Pipelining
Two Basic Control Schemes for Pipelined Data Paths
Data-stationary Time-stationary