Upload
sayee-krishna
View
238
Download
0
Embed Size (px)
Citation preview
7/22/2019 DSP Processor Fundamentals
1/58
Slide: 1
DSP Processor Fundamentals
Subhasish Mukherjee
7/22/2019 DSP Processor Fundamentals
2/58
Slide: 2
Salient Features of DSP Processors
Fast multiply and accumulate Multiple access memory architecture Specialized addressing modes Specialized execution control Peripherals and I/O interfaces
7/22/2019 DSP Processor Fundamentals
3/58
Slide: 3
DSP Processor Embodiments
Multichip modulesMultiple dies in a single package
Increased operating speed & reduced power dissipation Multiple processors on chip Chip sets
Dividing the processor into two or more packages
Makes sense when the processor is very complex & has large noof I/O pins
Saves cost DSP Cores
7/22/2019 DSP Processor Fundamentals
4/58
Slide: 4
Fixed-Point vs. Floating Point
Most DSP are Fixed-PointFixed Point DSP support integer and fraction arithmetic
Limited dynamic range and precision
Cheaper too.
Mostly use 16-bit format, though some use 20/24 bit format.
Floating point DSPs use mantissa and exponentrepresentation
They provide good dynamic range and precisionMostly use 32-bit format
Easier to program
7/22/2019 DSP Processor Fundamentals
5/58
Slide: 5
Fixed Point Data Path
7/22/2019 DSP Processor Fundamentals
6/58
Slide: 6
Content of Fixed Point Data Path
Typically incorporate a multiplier, an ALU,shifters, operand registers & accumulators.
Single cycle multipliers are central toprogrammable DSP
Often integrated with adder to make a multiply
accumulate unit.
7/22/2019 DSP Processor Fundamentals
7/58
Slide: 7
Accumulator
Holds intermediate and final results of MACoperation
Most DSP processors provide multiple
Accumulator. Have guard bits to accumulate a number of
values
Guard bits provide greater flexibility thanscaling.
7/22/2019 DSP Processor Fundamentals
8/58
Slide: 8
ALU
Implements basic arithmetic and logicaloperations in a single instruction cycle.
Common operations include add, subtract, increment,negate, logical and, or, not.
Differs in the word size used for logicaloperations.
7/22/2019 DSP Processor Fundamentals
9/58
Slide: 9
Shifter
Used for scaling the input by a power of 2 Either eliminates or reduces the possibilities of
overflow to an acceptably lower level. Trade off is loss of precision and dynamic
range. Barrel shifters offers more flexibility
7/22/2019 DSP Processor Fundamentals
10/58
Slide: 10
Memory Architecture&
Addressing Schemes
7/22/2019 DSP Processor Fundamentals
11/58
Slide: 11
Motivation
FIR Filter involves followingoperations
Fetch the MAC instruction
Fetch coefficient h m
Fetch delayed input x(n-m)
Multiply both
Add with the previous result
Shift data in the delay line
The above set of operationsdone for all the taps of thefilter for each sample
z- z- z-
h0 h1 h2 h N-1 h N
Input x(n)
Output y(n)
)()0
()( mn x N
mmhn y
7/22/2019 DSP Processor Fundamentals
12/58
Slide: 12
Motivation
Conventional processors need more than 5 cycles/tap/sample to implementthe above FIR filter
DSP architectures try to reduce the cycles needed to compute this primitive This is accomplished by
Harvard architecture
Efficient addressing modes
7/22/2019 DSP Processor Fundamentals
13/58
Slide: 13
Harvard rchitecture
Basic Harvard Architecture
Separate program and data bus
different from Von-Neumann Architecture
Modification 1
Data fetches possible fromprogram memory
Opcode and one data fetch donein parallel
Basic Harvard Architecture
ProgramMemory
DataMemory
P BUS D BUS
Harvard Architecture Modification #1
Program/Data
Memory
DataMemory
P BUS D BUS
7/22/2019 DSP Processor Fundamentals
14/58
Slide: 14
Harvard rchitecture
Harvard Architecture Modification 2
ProgramMemory
Multi PortData
Memory
P BUS D BUS 1
D BUS 2
Modification 2
One program memory
One dual ported data memory
3 busses for the internal memory 2 for data
1 for program
2 data words can be fetched inparallel to an instruction
7/22/2019 DSP Processor Fundamentals
15/58
Slide: 15
Harvard rchitecture
Harvard Architecture Modification 3
ProgramMemory
DataMemory 1
P BUS D BUS 1
DataMemory 2
D BUS 2
ProgramCache
Modification 3
One program memory & Program Cache
Two Data memory
3 busses for the internal memory 2 for data & 1 for program
2 data words can be fetched in parallel to an instruction
7/22/2019 DSP Processor Fundamentals
16/58
Slide: 16
ddressing mode Circular ddressing
Avoids shifting of data in the delayline
Oldest element is overwritten by thenew element
Pointer wraps around once it crossesstart or the end of the circular buffer
Need to maintain 5 parameters forcircular buffer operation
Circular buffer - Example
X(n)X(n-1)
X(n-2)
X(n-3)
X(n-4)X(n-5)
X(n-6)
X(n-7)
Recent sample at time instant n
2nd recent sample at time instant n+1
Oldest sample at time instant n
Will be overwritten by the recentsample at instant n+1
X(n-m)
X(n)
X(n-m-1)
X(n-N)
Start
End
7/22/2019 DSP Processor Fundamentals
17/58
Slide: 17
Multiple Access Memories
Supports multiple, sequential access perinstruction cycle Can be combined with Harvard Architecture to
have better performance Supporting off-chip memory means introducing
significant additional delay between processorcore and memory
7/22/2019 DSP Processor Fundamentals
18/58
Slide: 18
Multiported Memories
Has multiple independent sets of address anddata connections Can provide multiple simultaneous accesses
Costly Supporting off-chip memory means larger and
more expensive package
7/22/2019 DSP Processor Fundamentals
19/58
Slide: 19
Program Cache
Simplest type is single instruction repeat buffer Can be extended to multi word repeat buffer Another type is single sector instruction cache Extended to multiple independent sector cache Used only for program instructions and not for
data
7/22/2019 DSP Processor Fundamentals
20/58
Slide: 20
Wait States
State in which processor waits to accessmemory Conflict Wait states
Multiple access to memory that can not handlemultiple access Externally requested wait states
Multiple processors sharing a data bus
TMS320C5x has a special READY pin which can beused by external hardware to signal the processorthat it must wait before accessing external memory.
7/22/2019 DSP Processor Fundamentals
21/58
Slide: 21
Multiprocessor Support- Memory Interface
Multiple external memory ports Sometimes multiple processors share one
external memory bus
Bus arbitration requiredTwo pins can be configured to act as bus requestand bus grant signals
TMS320C5x allows external access to on chipmemory through BR and IAQ signals Helpful formultiprocessor communication without sharedmemory
7/22/2019 DSP Processor Fundamentals
22/58
Slide: 22
Direct Memory Access
Handled by DMA controller Coupled with Bus Request and Bus Grant pins of
the processor
Some sophisticated DMA controllers reside on-chip and access on chip memory
Multiple channel DMA controllers handle
multiple memory transfer in parallel
7/22/2019 DSP Processor Fundamentals
23/58
Slide: 23
Memory Addressing Schemes
Implied addressingOperand addresses are implied
P = X * Y
Immediate data
Operand itself is encoded in the instruction
AX0 = 1234
Memory direct addressing
The address of the data in memory is enclosed in the instructionword
AX0 = DM(1000)
7/22/2019 DSP Processor Fundamentals
24/58
Slide: 24
Memory Addressing Schemes
Register direct addressingData being addressed reside in a register
SUBF R1, R2
Register indirect addressingData resides in memory and the address resides inthe register, A0 = A0 + *R5
Address Registers Memory
7 0x10000x1000
7/22/2019 DSP Processor Fundamentals
25/58
Slide: 25
Memory Addressing Schemes
Register indirect addressing with pre and postincrement
A0 = A0 + *R5++ (Post Increment)
A0 = A0 + *R5++R17 (Post Increment) Address incremented by the value stored in register R17
MOVE X: -(R0), A1 (Pre Decrement)
7/22/2019 DSP Processor Fundamentals
26/58
Slide: 26
Memory Addressing Schemes
Register indirect addressing with indexing Values stored in two address registers are added toform an effective address
Does not change the content of any of the addressregisters
MOVE Y1, X: (R6 + N6)
LDI *-AR1(1), R7
7/22/2019 DSP Processor Fundamentals
27/58
Slide: 27
Memory Addressing Schemes
Register addressing with bit reversalUsed for FFT
The output or input is in a scrambled order
000 = 0
100 = 4
010 = 2110 = 6
001 = 1
101 = 5
011 = 3111 = 7
7/22/2019 DSP Processor Fundamentals
28/58
Slide: 28
Instruction Set
7/22/2019 DSP Processor Fundamentals
29/58
Slide: 29
Instruction Types
Arithmetic & Multiplication Logic Operations Shifting Rotation
Comparison Looping Branching, subroutine calls and returns Conditional instruction Special function instruction
Block floating point instructions, stack operation etc. Bit manipulation instructions
7/22/2019 DSP Processor Fundamentals
30/58
Slide: 30
Registers
Accumulators General & special purpose registers Address registers Other registers
Stack pointer
Program counter
Loop registers
7/22/2019 DSP Processor Fundamentals
31/58
Slide: 31
Parallel Move Support
Operand related parallel movesMPY (R0), (R4)
Accesses are limited to those required by arithmeticoperation
Operand unrelated parallel moves
MPY X0, Y0, A X: (R0)+, X0 Y1, Y: (R4)+
Memory accesses unrelated to the operands of the ALU operation
7/22/2019 DSP Processor Fundamentals
32/58
Slide: 32
Orthogonality
Indicates the extent to which processorinstruction set is consistent Depends upon
Consistency & Completeness of the instruction setDegree to which operands and addressing modes areuniformly available with different operations
7/22/2019 DSP Processor Fundamentals
33/58
Slide: 33
Assembly Language Format
Traditional opcode operand variety
C Like Syntax
MPY X0, Y0ADD P,A
MOV (R0), X0
JMP LOOP
P = X0 * Y0
A = P + AX0 = *R0
GOTO LOOP
7/22/2019 DSP Processor Fundamentals
34/58
Slide: 34
Execution Control
7/22/2019 DSP Processor Fundamentals
35/58
Slide: 35
Looping
Hardware looping
Software looping
RPT #16
MAC (R0)+, (R4)+, A
MOVE #16, B
LOOP: MAC (R0)+, (R4)+, A
DEC BJNE LOOP
7/22/2019 DSP Processor Fundamentals
36/58
Slide: 36
Considerations in Looping
Sometimes 0 loop repetition count causes theprocessor to repeat the loop the maximumnumber of times
Consider loop effects on interrupt latency
7/22/2019 DSP Processor Fundamentals
37/58
Slide: 37
Nesting
Directly nestableHardware loop instruction placed within the outerloop
Partially nestableSingle instruction loop inside multi instruction loop
Software nestable
Multi instruction hardware loops are nested by savingvarious registers like loop index, loop start & loopcount
7/22/2019 DSP Processor Fundamentals
38/58
Slide: 38
Interrupts
Interrupt sourcesOn chip peripherals, External interrupt lines andsoftware interrupts
Interrupt vectors Associating each interrupt with a different memoryaddress
Typically one or two word long and are located in lowmemory
Usually contains a branch or subroutine call to aninterrupt handler routine
l
7/22/2019 DSP Processor Fundamentals
39/58
Slide: 39
Interrupt latency
Time between the assertion of an external interrupt lineto the execution of the first word of the interrupt vector Following adds up to the interrupt latency
Interrupt line to be asserted prior to the start of an instruction
cycle when interrupt is said to have occurred (Set up time)To be passed through synchronization stages
Wait until the processor reaches an interruptible state
Wait until all instructions in the pipeline are finished
If interrupt vector holds only address of the interrupt routinethen the time required to branch to that location
k
7/22/2019 DSP Processor Fundamentals
40/58
Slide: 40
Stacks
Typically one of the three kinds of stack supportis provided
Shadow registers
Hardware stackSoftware stack
7/22/2019 DSP Processor Fundamentals
41/58
Slide: 41
Pipelining
Pi li i d P f
7/22/2019 DSP Processor Fundamentals
42/58
Slide: 42
Pipelining and Performance
Technique for increasing the performance of aprocessorBreaks a sequence of operations into smaller pieces
Execute the pieces in parallel whenever possible Hypothetical processor
Fetch an instruction word from memory
Decode the instruction
Read/write data operands from/to memory
Execute the ALU or MAC operation of the instruction
Pi li i d P f
7/22/2019 DSP Processor Fundamentals
43/58
Slide: 43
Pipelining and Performance
Instruction Fetch
Decode
DataRead/Write
Execute
Clock Cycle
I1 I2 I3 I4 I5 I6 I7
I1 I2 I3 I4 I5 I6
I1 I2 I3 I4 I5
I1 I2 I3 I4
1 2 3 4 5 6 7P
I
P
E
LI
N
E
D
E
P
TH
Perfect Overlap
100% utilization of processor execution stages
Ideal scenario
C fli i I i
7/22/2019 DSP Processor Fundamentals
44/58
Slide: 44
Conflicting Instruction
Instruction Fetch
Decode
DataRead/Write
Execute
Clock Cycle
I1 I2 I3 I4 I5 I6 I7
I1 I2 I3 I4 I5 I6
I1 I2 I2 I3 I4 I5
I1 I2 I3 I4
1 2 3 4 5 6 7P
I
P
E
LI
N
E
D
E
P
TH
I2 tries to write to memory while I3 tries to read memory
Solution to this problem is interlocking
Interlocking is delaying the conflicting instruction in pipeline
I l ki
7/22/2019 DSP Processor Fundamentals
45/58
Slide: 45
Interlocking
Instruction Fetch
Decode
DataRead/Write
Execute
Clock Cycle
I1 I2 I3 I4 I4 I5 I6
I1 I2 I3 I3 I4 I5
I1 I2 I2 I3 I4
I1 I2 NOP I3
1 2 3 4 5 6 7P
I
P
E
LI
N
E
D
E
P
TH
Interlocking resolves resource conflict Pipeline sequencer holds instruction I3 at the decode stage
I4 is held at the fetch stage
One instruction cycle penalty occurs
M lti l B hi Eff t
7/22/2019 DSP Processor Fundamentals
46/58
Slide: 46
Multicycle Branching Effects
Instruction Fetch
Decode
DataRead/Write
Execute
Clock Cycle
BR I2 --- --- I4 I5 I6 I7
BR --- --- --- I4 I5 I6
BR --- --- --- I4 I5
BR NOP NOP NOP I4
1 2 3 4 5 6 7
When a branch instruction reaches the decode stage already one instruction isfetched which has to be flushed from the pipeline NOPs are executed for the invalidated pipeline slots
Multicycle branch typically executes for as many cycles as pipeline depth
D l d B hi Eff t
7/22/2019 DSP Processor Fundamentals
47/58
Slide: 47
Delayed Branching Effects
Instruction Fetch
Decode
DataRead/Write
Execute
Clock Cycle
BR N2 N3 N4 I4 I5 I6 I7
BR N2 N3 N4 I4 I5 I6
BR N2 N3 N4 I4 I5
BR N2 N3 N4 I4
1 2 3 4 5 6 7
An alternative to multicycle branch, does not flush the pipeline
Instructions to be executed before the branch instruction must be locatedexactly after the branch instruction in the memory
Increased efficiency and confusing code on casual inspection
I t t Eff t
7/22/2019 DSP Processor Fundamentals
48/58
Slide: 48
Interrupt Effects
Instruction Fetch
Decode
DataRead/Write
Execute
Clock Cycle
I6 --- --- --- V1 V2 V3 V4
I5 INTR --- --- --- V1 V2 V3
I4 I5 INTR --- --- --- V1 V2
I3 I4 I5 INTR NOP NOP NOP V1
3 4 5 6 7 8 9 10
Processor inserts the INTR instruction in the pipeline INTR is a special branch instruction that flushes the pipeline and jumps to theappropriate interrupt vector location
Causes a 4 cycle delay before the first word of the interrupt vector is executed
I6 is flushed but would be refetched on returning from interrupt
INETRRUPT
F t I t t P i g
7/22/2019 DSP Processor Fundamentals
49/58
Slide: 49
Fast Interrupt Processing
Instruction Fetch
Decode
Execute
Clock Cycle
I3 I4 V1 V2 I5 I6 I7 I8
I2 I3 I4 V1 V2 I5 I6 I7
I1 I2 I3 I4 V1 V2 I5 I6
1 2 3 4 5 6 7 8
Interrupt handler stored at the interrupt vector location In this case V1 & V2 are the two instructions in the interrupt vector
This is called fast inter ru pt as this does not insert any delay in the pipeline
INETRRUPT
7/22/2019 DSP Processor Fundamentals
50/58
Slide: 50
Peripherals
Serial Ports
7/22/2019 DSP Processor Fundamentals
51/58
Slide: 51
Serial Ports
Serial interface transmits and receives data onebit at a time Requires far fewer interface pins than parallel
interface Used for variety of applications
Sending/receiving data to/from A/D and D/Aconverters
Sending/receiving data from other processors or DSP
Communicating with other external peripherals
Serial Ports
7/22/2019 DSP Processor Fundamentals
52/58
Slide: 52
Serial Ports
SynchronousTransmits one bit clock signal in addition to the serialdata bits
Receiver uses that for sampling the received data
Asynchronous
Do not transmit separate clock signal
Receiver deduces the clock signal from the serialdata itself
More complex
Data and Clock
7/22/2019 DSP Processor Fundamentals
53/58
Slide: 53
Data and Clock
0 1 - - - - -
BIT
CLOCK
FRAME
SYNC
DATA
Most DSPs allow changing the clock polarity, data polarity and shift direction
Frame sync signal indicates the position of the first bit of a data word on theserial data line
Common formats are bit length and word length
Also can have multiple words per frame
Serial Clock Generation
7/22/2019 DSP Processor Fundamentals
54/58
Slide: 54
Serial Clock Generation
Provide Circuitry for clock generation Usually called serial clock generation support Normally done by scaling the master clock in
DSP Usually contains a pre-scaler and a down
counter
Time Division Multiplex
7/22/2019 DSP Processor Fundamentals
55/58
Slide: 55
Time Division Multiplex
CLOCK
FRAME SYNC
DATA
CLOCK
FRAME
SYNCDATA
CLOCK
FRAME SYNC
DATA
CLOCK
FRAME SYNC
DATA
CLOCK
FRAME SYNC
DATA
One processor (or External Circuitry) generates the clock and Frame sync signal
Frame sync indicates the start of a new set of time slots
Transmitted data word might contain some number of bits to indicate thedestination DSP. Other bits are used for data
DSP DSP DSP DSP
Timers
7/22/2019 DSP Processor Fundamentals
56/58
Slide: 56
Timers
Programmable timers are often a source of periodicinterrupts
May also be used as a software controlled square wavegenerator
Clock Source
Prescale Preload Value Counter Preload Value
Parallel Ports
7/22/2019 DSP Processor Fundamentals
57/58
Slide: 57
Parallel Ports
Transmit/receive multiple data bits at a time Faster than serial ports but require more pins External data bus may be used as a parallel port Can also have separate parallel ports
Bit I/O portsIndividual pins can be made input or output on a bit by bit basis
Host ports
Specialized 8/16 bit bidirectional parallel ports used for data transferbetween DSP and host microprocessor
May be used to control the DSP Communication ports
Special parallel port intended for multiprocessor communication
7/22/2019 DSP Processor Fundamentals
58/58