DSP Processor Fundamentals

7/22/2019 DSP Processor Fundamentals

1/58

Slide: 1

DSP Processor Fundamentals

Subhasish Mukherjee


2/58

Slide: 2

Salient Features of DSP Processors

Fast multiply and accumulate Multiple access memory architecture Specialized addressing modes Specialized execution control Peripherals and I/O interfaces


3/58

Slide: 3

DSP Processor Embodiments

Multichip modulesMultiple dies in a single package

Increased operating speed & reduced power dissipation Multiple processors on chip Chip sets

Dividing the processor into two or more packages

Makes sense when the processor is very complex & has large noof I/O pins

Saves cost DSP Cores


4/58

Slide: 4

Fixed-Point vs. Floating Point

Most DSP are Fixed-PointFixed Point DSP support integer and fraction arithmetic

Limited dynamic range and precision

Cheaper too.

Mostly use 16-bit format, though some use 20/24 bit format.

Floating point DSPs use mantissa and exponentrepresentation

They provide good dynamic range and precisionMostly use 32-bit format

Easier to program


5/58

Slide: 5

Fixed Point Data Path


6/58

Slide: 6

Content of Fixed Point Data Path

Typically incorporate a multiplier, an ALU,shifters, operand registers & accumulators.

Single cycle multipliers are central toprogrammable DSP

Often integrated with adder to make a multiply

accumulate unit.


7/58

Slide: 7

Accumulator

Holds intermediate and final results of MACoperation

Most DSP processors provide multiple

Accumulator. Have guard bits to accumulate a number of

values

Guard bits provide greater flexibility thanscaling.


8/58

Slide: 8

ALU

Implements basic arithmetic and logicaloperations in a single instruction cycle.

Common operations include add, subtract, increment,negate, logical and, or, not.

Differs in the word size used for logicaloperations.


9/58

Slide: 9

Shifter

Used for scaling the input by a power of 2 Either eliminates or reduces the possibilities of

overflow to an acceptably lower level. Trade off is loss of precision and dynamic

range. Barrel shifters offers more flexibility


10/58

Slide: 10

Memory Architecture&

Addressing Schemes


11/58

Slide: 11

Motivation

FIR Filter involves followingoperations

Fetch the MAC instruction

Fetch coefficient h m

Fetch delayed input x(n-m)

Multiply both

Add with the previous result

Shift data in the delay line

The above set of operationsdone for all the taps of thefilter for each sample

z- z- z-

h0 h1 h2 h N-1 h N

Input x(n)

Output y(n)

)()0

()( mn x N

mmhn y


12/58

Slide: 12

Motivation

Conventional processors need more than 5 cycles/tap/sample to implementthe above FIR filter

DSP architectures try to reduce the cycles needed to compute this primitive This is accomplished by

Harvard architecture

Efficient addressing modes


13/58

Slide: 13

Harvard rchitecture

Basic Harvard Architecture

Separate program and data bus

different from Von-Neumann Architecture

Modification 1

Data fetches possible fromprogram memory

Opcode and one data fetch donein parallel

Basic Harvard Architecture

ProgramMemory

DataMemory

P BUS D BUS

Harvard Architecture Modification #1

Program/Data

Memory

DataMemory

P BUS D BUS


14/58

Slide: 14

Harvard rchitecture

Harvard Architecture Modification 2

ProgramMemory

Multi PortData

Memory

P BUS D BUS 1

D BUS 2

Modification 2

One program memory

One dual ported data memory

3 busses for the internal memory 2 for data

1 for program

2 data words can be fetched inparallel to an instruction


15/58

Slide: 15

Harvard rchitecture

Harvard Architecture Modification 3

ProgramMemory

DataMemory 1

P BUS D BUS 1

DataMemory 2

D BUS 2

ProgramCache

Modification 3

One program memory & Program Cache

Two Data memory

3 busses for the internal memory 2 for data & 1 for program

2 data words can be fetched in parallel to an instruction


16/58

Slide: 16

ddressing mode Circular ddressing

Avoids shifting of data in the delayline

Oldest element is overwritten by thenew element

Pointer wraps around once it crossesstart or the end of the circular buffer

Need to maintain 5 parameters forcircular buffer operation

Circular buffer - Example

X(n)X(n-1)

X(n-2)

X(n-3)

X(n-4)X(n-5)

X(n-6)

X(n-7)

Recent sample at time instant n

2nd recent sample at time instant n+1

Oldest sample at time instant n

Will be overwritten by the recentsample at instant n+1

X(n-m)

X(n)

X(n-m-1)

X(n-N)

Start

End


17/58

Slide: 17

Multiple Access Memories

Supports multiple, sequential access perinstruction cycle Can be combined with Harvard Architecture to

have better performance Supporting off-chip memory means introducing

significant additional delay between processorcore and memory


18/58

Slide: 18

Multiported Memories

Has multiple independent sets of address anddata connections Can provide multiple simultaneous accesses

Costly Supporting off-chip memory means larger and

more expensive package


19/58

Slide: 19

Program Cache

Simplest type is single instruction repeat buffer Can be extended to multi word repeat buffer Another type is single sector instruction cache Extended to multiple independent sector cache Used only for program instructions and not for

data


20/58

Slide: 20

Wait States

State in which processor waits to accessmemory Conflict Wait states

Multiple access to memory that can not handlemultiple access Externally requested wait states

Multiple processors sharing a data bus

TMS320C5x has a special READY pin which can beused by external hardware to signal the processorthat it must wait before accessing external memory.


21/58

Slide: 21

Multiprocessor Support- Memory Interface

Multiple external memory ports Sometimes multiple processors share one

external memory bus

Bus arbitration requiredTwo pins can be configured to act as bus requestand bus grant signals

TMS320C5x allows external access to on chipmemory through BR and IAQ signals Helpful formultiprocessor communication without sharedmemory


22/58

Slide: 22

Direct Memory Access

Handled by DMA controller Coupled with Bus Request and Bus Grant pins of

the processor

Some sophisticated DMA controllers reside on-chip and access on chip memory

Multiple channel DMA controllers handle

multiple memory transfer in parallel


23/58

Slide: 23

Memory Addressing Schemes

Implied addressingOperand addresses are implied

P = X * Y

Immediate data

Operand itself is encoded in the instruction

AX0 = 1234

Memory direct addressing

The address of the data in memory is enclosed in the instructionword

AX0 = DM(1000)


24/58

Slide: 24


Register direct addressingData being addressed reside in a register

SUBF R1, R2

Register indirect addressingData resides in memory and the address resides inthe register, A0 = A0 + *R5

Address Registers Memory

7 0x10000x1000


25/58

Slide: 25


Register indirect addressing with pre and postincrement

A0 = A0 + *R5++ (Post Increment)

A0 = A0 + *R5++R17 (Post Increment) Address incremented by the value stored in register R17

MOVE X: -(R0), A1 (Pre Decrement)


26/58

Slide: 26


Register indirect addressing with indexing Values stored in two address registers are added toform an effective address

Does not change the content of any of the addressregisters

MOVE Y1, X: (R6 + N6)

LDI *-AR1(1), R7


27/58

Slide: 27


Register addressing with bit reversalUsed for FFT

The output or input is in a scrambled order

000 = 0

100 = 4

010 = 2110 = 6

001 = 1

101 = 5

011 = 3111 = 7


28/58

Slide: 28

Instruction Set


29/58

Slide: 29

Instruction Types

Arithmetic & Multiplication Logic Operations Shifting Rotation

Comparison Looping Branching, subroutine calls and returns Conditional instruction Special function instruction

Block floating point instructions, stack operation etc. Bit manipulation instructions


30/58

Slide: 30

Registers

Accumulators General & special purpose registers Address registers Other registers

Stack pointer

Program counter

Loop registers


31/58

Slide: 31

Parallel Move Support

Operand related parallel movesMPY (R0), (R4)

Accesses are limited to those required by arithmeticoperation

Operand unrelated parallel moves

MPY X0, Y0, A X: (R0)+, X0 Y1, Y: (R4)+

Memory accesses unrelated to the operands of the ALU operation


32/58

Slide: 32

Orthogonality

Indicates the extent to which processorinstruction set is consistent Depends upon

Consistency & Completeness of the instruction setDegree to which operands and addressing modes areuniformly available with different operations


33/58

Slide: 33

Assembly Language Format

Traditional opcode operand variety

C Like Syntax

MPY X0, Y0ADD P,A

MOV (R0), X0

JMP LOOP

P = X0 * Y0

A = P + AX0 = *R0

GOTO LOOP


34/58

Slide: 34

Execution Control


35/58

Slide: 35

Looping

Hardware looping

Software looping

RPT #16

MAC (R0)+, (R4)+, A

MOVE #16, B

LOOP: MAC (R0)+, (R4)+, A

DEC BJNE LOOP


36/58

Slide: 36

Considerations in Looping

Sometimes 0 loop repetition count causes theprocessor to repeat the loop the maximumnumber of times

Consider loop effects on interrupt latency


37/58

Slide: 37

Nesting

Directly nestableHardware loop instruction placed within the outerloop

Partially nestableSingle instruction loop inside multi instruction loop

Software nestable

Multi instruction hardware loops are nested by savingvarious registers like loop index, loop start & loopcount


38/58

Slide: 38

Interrupts

Interrupt sourcesOn chip peripherals, External interrupt lines andsoftware interrupts

Interrupt vectors Associating each interrupt with a different memoryaddress

Typically one or two word long and are located in lowmemory

Usually contains a branch or subroutine call to aninterrupt handler routine

l


39/58

Slide: 39

Interrupt latency

Time between the assertion of an external interrupt lineto the execution of the first word of the interrupt vector Following adds up to the interrupt latency

Interrupt line to be asserted prior to the start of an instruction

cycle when interrupt is said to have occurred (Set up time)To be passed through synchronization stages

Wait until the processor reaches an interruptible state

Wait until all instructions in the pipeline are finished

If interrupt vector holds only address of the interrupt routinethen the time required to branch to that location

k


40/58

Slide: 40

Stacks

Typically one of the three kinds of stack supportis provided

Shadow registers

Hardware stackSoftware stack


41/58

Slide: 41

Pipelining

Pi li i d P f


42/58

Slide: 42

Pipelining and Performance

Technique for increasing the performance of aprocessorBreaks a sequence of operations into smaller pieces

Execute the pieces in parallel whenever possible Hypothetical processor

Fetch an instruction word from memory

Decode the instruction

Read/write data operands from/to memory

Execute the ALU or MAC operation of the instruction

Pi li i d P f


43/58

Slide: 43

Pipelining and Performance

Instruction Fetch

Decode

DataRead/Write

Execute

Clock Cycle

I1 I2 I3 I4 I5 I6 I7

I1 I2 I3 I4 I5 I6

I1 I2 I3 I4 I5

I1 I2 I3 I4

1 2 3 4 5 6 7P

I

P

E

LI

N

E

D

E

P

TH

Perfect Overlap

100% utilization of processor execution stages

Ideal scenario

C fli i I i


44/58

Slide: 44

Conflicting Instruction

Instruction Fetch

Decode

DataRead/Write

Execute

Clock Cycle

I1 I2 I3 I4 I5 I6 I7

I1 I2 I3 I4 I5 I6

I1 I2 I2 I3 I4 I5

I1 I2 I3 I4

1 2 3 4 5 6 7P

I

P

E

LI

N

E

D

E

P

TH

I2 tries to write to memory while I3 tries to read memory

Solution to this problem is interlocking

Interlocking is delaying the conflicting instruction in pipeline

I l ki


45/58

Slide: 45

Interlocking

Instruction Fetch

Decode

DataRead/Write

Execute

Clock Cycle

I1 I2 I3 I4 I4 I5 I6

I1 I2 I3 I3 I4 I5

I1 I2 I2 I3 I4

I1 I2 NOP I3

1 2 3 4 5 6 7P

I

P

E

LI

N

E

D

E

P

TH

Interlocking resolves resource conflict Pipeline sequencer holds instruction I3 at the decode stage

I4 is held at the fetch stage

One instruction cycle penalty occurs

M lti l B hi Eff t


46/58

Slide: 46

Multicycle Branching Effects

Instruction Fetch

Decode

DataRead/Write

Execute

Clock Cycle

BR I2 --- --- I4 I5 I6 I7

BR --- --- --- I4 I5 I6

BR --- --- --- I4 I5

BR NOP NOP NOP I4

1 2 3 4 5 6 7

When a branch instruction reaches the decode stage already one instruction isfetched which has to be flushed from the pipeline NOPs are executed for the invalidated pipeline slots

Multicycle branch typically executes for as many cycles as pipeline depth

D l d B hi Eff t


47/58

Slide: 47

Delayed Branching Effects

Instruction Fetch

Decode

DataRead/Write

Execute

Clock Cycle

BR N2 N3 N4 I4 I5 I6 I7

BR N2 N3 N4 I4 I5 I6

BR N2 N3 N4 I4 I5

BR N2 N3 N4 I4

1 2 3 4 5 6 7

An alternative to multicycle branch, does not flush the pipeline

Instructions to be executed before the branch instruction must be locatedexactly after the branch instruction in the memory

Increased efficiency and confusing code on casual inspection

I t t Eff t


48/58

Slide: 48

Interrupt Effects

Instruction Fetch

Decode

DataRead/Write

Execute

Clock Cycle

I6 --- --- --- V1 V2 V3 V4

I5 INTR --- --- --- V1 V2 V3

I4 I5 INTR --- --- --- V1 V2

I3 I4 I5 INTR NOP NOP NOP V1

3 4 5 6 7 8 9 10

Processor inserts the INTR instruction in the pipeline INTR is a special branch instruction that flushes the pipeline and jumps to theappropriate interrupt vector location

Causes a 4 cycle delay before the first word of the interrupt vector is executed

I6 is flushed but would be refetched on returning from interrupt

INETRRUPT

F t I t t P i g


49/58

Slide: 49

Fast Interrupt Processing

Instruction Fetch

Decode

Execute

Clock Cycle

I3 I4 V1 V2 I5 I6 I7 I8

I2 I3 I4 V1 V2 I5 I6 I7

I1 I2 I3 I4 V1 V2 I5 I6

1 2 3 4 5 6 7 8

Interrupt handler stored at the interrupt vector location In this case V1 & V2 are the two instructions in the interrupt vector

This is called fast inter ru pt as this does not insert any delay in the pipeline

INETRRUPT


50/58

Slide: 50

Peripherals

Serial Ports


51/58

Slide: 51

Serial Ports

Serial interface transmits and receives data onebit at a time Requires far fewer interface pins than parallel

interface Used for variety of applications

Sending/receiving data to/from A/D and D/Aconverters

Sending/receiving data from other processors or DSP

Communicating with other external peripherals

Serial Ports


52/58

Slide: 52

Serial Ports

SynchronousTransmits one bit clock signal in addition to the serialdata bits

Receiver uses that for sampling the received data

Asynchronous

Do not transmit separate clock signal

Receiver deduces the clock signal from the serialdata itself

More complex

Data and Clock


53/58

Slide: 53

Data and Clock

0 1 - - - - -

BIT

CLOCK

FRAME

SYNC

DATA

Most DSPs allow changing the clock polarity, data polarity and shift direction

Frame sync signal indicates the position of the first bit of a data word on theserial data line

Common formats are bit length and word length

Also can have multiple words per frame

Serial Clock Generation


54/58

Slide: 54

Serial Clock Generation

Provide Circuitry for clock generation Usually called serial clock generation support Normally done by scaling the master clock in

DSP Usually contains a pre-scaler and a down

counter

Time Division Multiplex


55/58

Slide: 55

Time Division Multiplex

CLOCK

FRAME SYNC

DATA

CLOCK

FRAME

SYNCDATA

CLOCK

FRAME SYNC

DATA

CLOCK

FRAME SYNC

DATA

CLOCK

FRAME SYNC

DATA

One processor (or External Circuitry) generates the clock and Frame sync signal

Frame sync indicates the start of a new set of time slots

Transmitted data word might contain some number of bits to indicate thedestination DSP. Other bits are used for data

DSP DSP DSP DSP

Timers


56/58

Slide: 56

Timers

Programmable timers are often a source of periodicinterrupts

May also be used as a software controlled square wavegenerator

Clock Source

Prescale Preload Value Counter Preload Value

Parallel Ports


57/58

Slide: 57

Parallel Ports

Transmit/receive multiple data bits at a time Faster than serial ports but require more pins External data bus may be used as a parallel port Can also have separate parallel ports

Bit I/O portsIndividual pins can be made input or output on a bit by bit basis

Host ports

Specialized 8/16 bit bidirectional parallel ports used for data transferbetween DSP and host microprocessor

May be used to control the DSP Communication ports

Special parallel port intended for multiprocessor communication


58/58

Documents

DSP Processor Fundamentals