Download pdf - DSP478 week1 [Uyumluluk Modu] - eskisehir.edu.treem.eskisehir.edu.tr/mfidan/EEM 478/icerik/eem478_dsphw_week1.pdf1 $ elw 0,36 [ elw zlwk elw uhvxow . . . elw elw elw 0)/236 [ elw zlwk

EEM478-WEEK1

Introduction

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002Chapter 1, Slide 2

Learning Objectives

Why process signals digitally?

Definition of a real-time application.

Why use Digital Signal Processing processors?

What are the typical DSP algorithms?

Parameters to consider when choosing a DSP processor.

Programmable vs ASIC DSP.

Texas Instruments’ TMS320 family.


Why go digital?

Digital signal processing techniques are now so powerful that sometimes it is extremely difficult, if not impossible, for analogue signal processing to achieve similar performance.

Examples: FIR filter with linear phase.

Adaptive filters.


Why go digital?

Analogue signal processing is achieved by using analogue components such as: Resistors.

Capacitors.

Inductors.

The inherent tolerances associated with these components, temperature, voltage changes and mechanical vibrations can dramatically affect the effectiveness of the analogue circuitry.


Why go digital?

With DSP it is easy to: Change applications.

Correct applications.

Update applications.

Additionally DSP reduces: Noise susceptibility.

Chip count.

Development time.

Cost.

Power consumption.


Why NOT go digital?

High frequency signals cannot be processed digitally because of two reasons: Analog to Digital Converters, ADC cannot

work fast enough.

The application can be too complex to be performed in real-time.


DSP processors have to perform tasks in real-time, so how do we define real-time?

The definition of real-time depends on the application.

Example: a 100-tap FIR filter is performed in real-time if the DSP can perform and complete the following operation between two samples:

Real-time processing


We can say that we have a real-time application if: Waiting Time 0

Real-time processing

Processing TimeWaiting Time

Sample Timen n+1


Why not use a General Purpose Processor (GPP) such as a Pentium instead of a DSP processor? What is the power consumption of a

Pentium and a DSP processor?

What is the cost of a Pentium and a DSP processor?

Why do we need DSP processors?


Use a DSP processor when the following are required: Cost saving.

Smaller size.

Low power consumption.

Processing of many “high” frequency signals in real-time.

Use a GPP processor when the following are required: Large memory.

Advanced operating systems.

Why do we need DSP processors?


What are the typical DSP algorithms?

The Sum of Products (SOP) is the key element in most DSP algorithms:


Hardware vs. Microcode multiplication

DSP processors are optimised to perform multiplication and addition operations.

Multiplication and addition are done in hardware and in one cycle.

Example: 4-bit multiply (unsigned).

1011x 1110

1011x 1110

Hardware Microcode

10011010 00001011.1011..

1011...

10011010

Cycle 1Cycle 2Cycle 3Cycle 4

Cycle 5


Parameters to consider when choosing a DSP processor

Parameter

Arithmetic format

Extended floating point

Extended Arithmetic

Performance (peak)

Number of hardware multipliers

Number of registers

Internal L1 program memory cache

Internal L1 data memory cache

Internal L2 cache

32-bit

N/A

40-bit

1200MIPS

2 (16 x 16-bit) with 32-bit result

32

32K

32K

512K

32-bit

64-bit

40-bit

1200MFLOPS

2 (32 x 32-bit) with 32 or 64-bit result

32

32K

32K

512K

TMS320C6211 (@150MHz)

TMS320C6711 (@150MHz)

C6711 Datasheet: \Links\TMS320C6711.pdf

C6211 Datasheet: \Links\TMS320C6211.pdf


Parameters to consider when choosing a DSP processor

Parameter

I/O bandwidth: Serial Ports (number/speed)

DMA channels

Multiprocessor support

Supply voltage

Power management

On-chip timers (number/width)

Cost

Package

External memory interface controller

JTAG

2 x 75Mbps

16

Not inherent

3.3V I/O, 1.8V Core

Yes

2 x 32-bit

US$ 21.54

256 Pin BGA

Yes

Yes

2 x 75Mbps

16

Not inherent

3.3V I/O, 1.8V Core

Yes

2 x 32-bit

US$ 21.54

256 Pin BGA

Yes

Yes

TMS320C6211 (@150MHz)

TMS320C6711 (@150MHz)


Floating vs. Fixed point processors

Applications which require: High precision.

Wide dynamic range.

High signal-to-noise ratio.

Ease of use.

Need a floating point processor.

Drawback of floating point processors: Higher power consumption.

Can be higher cost.

Can be slower than fixed-point counterparts and larger in size.


Floating vs. Fixed point processors

It is the application that dictates which device and platform to use in order to achieve optimum performance at a low cost.

For educational purposes, use the floating-point device (C6711) as it can support both fixed and floating point operations.


General Purpose DSP vs. DSP in ASIC

Application Specific Integrated Circuits (ASICs) are semiconductors designed for dedicated functions.

The advantages and disadvantages of using ASICs are listed below:

Advantages

• High throughput• Lower silicon area• Lower power consumption• Improved reliability• Reduction in system noise• Low overall system cost

Disadvantages

• High investment cost• Less flexibility• Long time from design to

market


Texas Instruments’ TMS320 family

Different families and sub-families exist to support different markets.

Lowest CostControl Systems Motor Control Storage Digital Ctrl Systems

C2000 C5000

EfficiencyBest MIPS per

Watt / Dollar / Size Wireless phones Internet audio players Digital still cameras Modems Telephony VoIP

C6000

Multi Channel and Multi Function App's

Comm Infrastructure Wireless Base-stations DSL Imaging Multi-media Servers Video

Performance &Best Ease-of-Use


C6713C62x™

C6000 RoadmapP

erf

orm

an

ce

Time

Software CompatibleFloating PointFloating Point

Multi-coreMulti-core C64x™ DSP1.1 GHz

C64x™ DSP1.1 GHz

C64x™ DSPC64x™ DSP

2nd Generation (Fixed Point)

General Purpose C6414C6414 C6415C6415 C6416C6416

MediaGateway

3G Wireless Infrastructure

C6201

C6701

C6202

C6203

C6211C6711

C6204

1st Generation

C6205

C6712C67x™

Fixed-point

Floating-point

C6411C6411


Part 2

TMS320C6000 Architectural Overview


Describe C6000 CPU architecture.

Introduce some basic instructions.

Describe the C6000 memory map.

Provide an overview of the peripherals.

Learning Objectives


General DSP System Block Diagram

PERIPHERALS

Central

Processing

Unit

Internal Memory

Internal Buses

ExternalMemory


Implementation of Sum of Products (SOP)

It has been shown in Chapter 1 that SOP is the key element for most DSP algorithms.

So let’s write the code for this algorithm and at the same time discover the C6000 architecture.

Two basic

operations are required

for this algorithm.

(1) Multiplication

(2) Addition

Therefore two basic

instructions are required

Y =N

an xnn = 1

*

= a1 * x1 + a2 * x2 +... + aN * xN


Two basic

operations are required

for this algorithm.

(1) Multiplication

(2) Addition

Therefore two basic

instructions are required

Implementation of Sum of Products (SOP)

Y =N

an xnn = 1

*So let’s implement the SOP algorithm!

The implementation in this module will be done in assembly.

= a1 * x1 + a2 * x2 +... + aN * xN


Multiply (MPY)

The multiplication of a1 by x1 is done in assembly by the following instruction:

MPY a1, x1, Y

This instruction is performed by a multiplier unit that is called “.M”

Y =N

an xnn = 1

*

= a1 * x1 + a2 * x2 +... + aN * xN


Multiply (.M unit)

.M

Y =40

an xnn = 1

*

The . M unit performs multiplications in hardware

MPY .M a1, x1, Y

Note: 16-bit by 16-bit multiplier provides a 32-bit result.

32-bit by 32-bit multiplier provides a 64-bit result.


Addition (.?)

.M

.?

Y =40

an xnn = 1

*

MPY .M a1, x1, prod

ADD .? Y, prod, Y


Add (.L unit)

.M

.L

Y =40

an xnn = 1

*

MPY .M a1, x1, prod

ADD .L Y, prod, Y

RISC processors such as the C6000 use registers to hold the operands, so lets change this code.


Register File - A

Y =40

an xnn = 1

*

MPY .M a1, x1, prod

ADD .L Y, prod, Y

.M

.L

A0A1A2A3A4

A15

Register File A

.

.

.

a1x1

prod

32-bits

Y

Let us correct this by replacing a, x, prod and Y by the registers as shown above.


Specifying Register Names

Y =40

an xnn = 1

*

MPY .M A0, A1, A3

ADD .L A4, A3, A4

The registers A0, A1, A3 and A4 contain the values to be used by the instructions.

.M

.L

A0A1A2A3A4

A15

Register File A

.

.

.

a1x1

prod

32-bits

Y


Specifying Register Names

Y =40

an xnn = 1

*

MPY .M A0, A1, A3

ADD .L A4, A3, A4

Register File A contains 16 registers (A0 -A15) which are 32-bits wide.

.M

.L

A0A1A2A3A4

A15

Register File A

.

.

.

a1x1

prod

32-bits

Y


Data loading

Q: How do we load the operands into the registers?

.M

.L

A0A1A2A3A4

A15

Register File A

.

.

.

a1x1

prod

32-bits

Y


Load Unit “.D”

A: The operands are loaded into the registers by loading them from the memory using the .D unit.

.M

.L

A0

A1

A2

A3

A15

Register File A

.

.

.

a1x1

prod

32-bits

Y

.D

Data Memory

Q: How do we load the operands into the registers?


Load Unit “.D”

It is worth noting at this stage that the only way to access memory is through the .D unit.

.M

.L

A0

A1

A2

A3

A15

Register File A

.

.

.

a1x1

prod

32-bits

Y

.D

Data Memory


Load Instruction

Q: Which instruction(s) can be used for loading operands from the memory to the registers?

.M

.L

A0

A1

A2

A3

A15

Register File A

.

.

.

a1x1

prod

32-bits

Y

.D

Data Memory


Load Instructions (LDB, LDH,LDW,LDDW)

A: The load instructions..M

.L

A0

A1

A2

A3

A15

Register File A

.

.

.

a1x1

prod

32-bits

Y

.D

Data Memory

Q: Which instruction(s) can be used for loading operands from the memory to the registers?


Using the Load Instructions

00000000

00000002

00000004

00000006

00000008

Data

16-bits

Before using the load unit you have to be aware that this processor is byte addressable, which means that each byte is represented by a unique address.

Also the addresses are 32-bit wide.

address

FFFFFFFF


The syntax for the load instruction is:

Where:

Rn is a register that contains the address of the operand to be loaded

and

Rm is the destination register.


00000000

00000002

00000004

00000006

00000008

Data

a1x1

prod

16-bits

Y

address

FFFFFFFF

LD *Rn,Rm



The question now is how many bytes are going to be loaded into the destination register?


00000000

00000002

00000004

00000006

00000008

Data

a1x1

prod

16-bits

Y

address

FFFFFFFF

LD *Rn,Rm



LD *Rn,Rm


00000000

00000002

00000004

00000006

00000008

Data

a1x1

prod

16-bits

Y

address

FFFFFFFF

The answer, is that it depends on the instruction you choose:• LDB: loads one byte (8-bit)

• LDH: loads half word (16-bit)

• LDW: loads a word (32-bit)

• LDDW: loads a double word (64-bit)

Note: LD on its own does not exist.



00000000

00000002

00000004

00000006

00000008

Data

16-bits

address

FFFFFFFF

0xB0xA0xD0xC

Example:

If we assume that A5 = 0x4 then:

(1) LDB *A5, A7 ; gives A7 = 0x00000001

(2) LDH *A5,A7; gives A7 = 0x00000201

(3) LDW *A5,A7; gives A7 = 0x04030201

(4) LDDW *A5,A7:A6; gives A7:A6 = 0x0807060504030201

0x10x20x30x40x50x60x70x8


LD *Rn,Rm

01



00000000

00000002

00000004

00000006

00000008

Data

16-bits

address

FFFFFFFF

0xB0xA0xD0xC

Question:

If data can only be accessed by the load instruction and the .D unit, how can we load the register pointer Rn in the first place?

0x10x20x30x40x50x60x70x8


LD *Rn,Rm


The instruction MVKL will allow a move of a 16-bit constant into a register as shown below:

MVKL .? a, A5(‘a’ is a constant or label)

How many bits represent a full address?

32 bits

So why does the instruction not allow a 32-bit move?

All instructions are 32-bit wide (see instruction opcode).

Loading the Pointer Rn


To solve this problem another instruction is available:

MVKH


eg. MVKH .? a, A5(‘a’ is a constant or label)

ah

ah x

al a

A5

MVKL a, A5

MVKH a, A5

Finally, to move the 32-bit address to a register we can use:



MVKL0x1234FABC, A5

A5 = 0xFFFFFABC ; Wrong

Example 1

A5 = 0x87654321

MVKL0x1234FABC, A5

A5 = 0xFFFFFABC (sign extension)

MVKH0x1234FABC, A5

A5 = 0x1234FABC ; OK

Example 2

MVKH0x1234FABC, A5

A5 = 0x12344321

Always use MVKL then MVKH, look at the following examples:


LDH, MVKL and MVKH

.M

.L

A0

A1

A2

A3

A15

Register File A

.

.

.

ax

prod

32-bits

Y

.D

Data Memory

MVKL pt1, A5MVKH pt1, A5


LDH .D *A5, A0

LDH .D *A6, A1

MPY .M A0, A1, A3

ADD .L A4, A3, A4


Creating a loop



LDH .D *A5, A0

LDH .D *A6, A1

MPY .M A0, A1, A3

ADD .L A4, A3, A4

So far we have only implemented the SOP for one tap only, i.e.

Y= a1 * x1

So let’s create a loop so that we can implement the SOP for N Taps.


Creating a loop

With the C6000 processors there are no dedicated

instructions such as block repeat. The loop is created

using the B instruction.

So far we have only implemented the SOP for one tap only, i.e.

Y= a1 * x1

So let’s create a loop so that we can implement the SOP for N Taps.


What are the steps for creating a loop

1. Create a label to branch to.

2. Add a branch instruction, B.

3. Create a loop counter.

4. Add an instruction to decrement the loop counter.

5. Make the branch conditional based on the value in

the loop counter.


1. Create a label to branch to



loop LDH .D *A5, A0

LDH .D *A6, A1

MPY .M A0, A1, A3

ADD .L A4, A3, A4




loop LDH .D *A5, A0

LDH .D *A6, A1

MPY .M A0, A1, A3

ADD .L A4, A3, A4

B .? loop

2. Add a branch instruction, B.


Which unit is used by the B instruction?

.M

.L

A0

A1

A2

A3

A15

Register File A

.

.

.

ax

prod

32-bits

Y

.D

.M

.L

A0

A1

A2

A3

A15

Register File A

.

.

.

ax

prod

32-bits

Y

.D

Data Memory

.S



loop LDH .D *A5, A0

LDH .D *A6, A1

MPY .M A0, A1, A3

ADD .L A4, A3, A4

B .? loop


Data Memory

Which unit is used by the B instruction?

.M

.L

A0

A1

A2

A3

A15

Register File A

.

.

.

ax

prod

32-bits

Y

.D

.M

.L

A0

A1

A2

A3

A15

Register File A

.

.

.

ax

prod

32-bits

Y

.D

.S

MVKL .S pt1, A5MVKH .S pt1, A5


loop LDH .D *A5, A0

LDH .D *A6, A1

MPY .M A0, A1, A3

ADD .L A4, A3, A4

B .S loop


Data Memory

3. Create a loop counter.

.M

.L

A0

A1

A2

A3

A15

Register File A

.

.

.

ax

prod

32-bits

Y

.D

.M

.L

A0

A1

A2

A3

A15

Register File A

.

.

.

ax

prod

32-bits

Y

.D

.S


MVKL .S pt2, A6MVKH .S pt2, A6MVKL .S count, B0

loop LDH .D *A5, A0

LDH .D *A6, A1

MPY .M A0, A1, A3

ADD .L A4, A3, A4

B .S loop

B registers will be introduced later


4. Decrement the loop counter

.M

.L

A0

A1

A2

A3

A15

Register File A

.

.

.

ax

prod

32-bits

Y

.D

Data Memory

.M

.L

A0

A1

A2

A3

A15

Register File A

.

.

.

ax

prod

32-bits

Y

.D

.S


MVKL .S pt2, A6MVKH .S pt2, A6MVKL .S count, B0

loop LDH .D *A5, A0

LDH .D *A6, A1

MPY .M A0, A1, A3

ADD .L A4, A3, A4

SUB .S B0, 1, B0

B .S loop


What is the syntax for making instruction conditional?

[condition] Instruction Label

e.g.

[B1] B loop

(1) The condition can be one of the following registers: A1, A2, B0, B1, B2.

(2) Any instruction can be conditional.

5. Make the branch conditional based on the value in the loop counter


The condition can be inverted by adding the exclamation symbol “!” as follows:

[!condition] Instruction Label

e.g.

[!B0] B loop ;branch if B0 = 0

[B0] B loop ;branch if B0 != 0

5. Make the branch conditional based on the value in the loop counter


Data Memory

.M

.L

A0

A1

A2

A3

A15

Register File A

.

.

.

ax

prod

32-bits

Y

.D

.M

.L

A0

A1

A2

A3

A15

Register File A

.

.

.

ax

prod

32-bits

Y

.D

.S

MVKL .S2 pt1, A5MVKH .S2 pt1, A5

MVKL .S2 pt2, A6MVKH .S2 pt2, A6MVKL .S2 count, B0

loop LDH .D *A5, A0

LDH .D *A6, A1

MPY .M A0, A1, A3

ADD .L A4, A3, A4

SUB .S B0, 1, B0

[B0] B .S loop

5. Make the branch conditional


Case 1: B .S1 label Relative branch.

Label limited to +/- 220 offset.

More on the Branch Instruction (1)

With this processor all the instructions are encoded in a 32-bit.

Therefore the label must have a dynamic range of less than 32-bit as the instruction B has to be coded.

21-bit relative addressB

32-bit


More on the Branch Instruction (2)

By specifying a register as an operand instead of a label, it is possible to have an absolute branch.

This will allow a dynamic range of 232.

Case 2: B .S2 register Absolute branch.

Operates on .S2 ONLY!

5-bit register

codeB

32-bit


Testing the code

This code performs the following

operations:

a0*x0 + a0*x0 + a0*x0 + … + a0*x0

However, we would like to perform:

a0*x0 + a1*x1 + a2*x2 + … + aN*xN



loop LDH .D *A5, A0

LDH .D *A6, A1

MPY .M A0, A1, A3

ADD .L A4, A3, A4

SUB .S B0, 1, B0

[B0] B .S loop


Modifying the pointers

The solution is to modify the pointers

A5 and A6.



loop LDH .D *A5, A0

LDH .D *A6, A1

MPY .M A0, A1, A3

ADD .L A4, A3, A4

SUB .S B0, 1, B0

[B0] B .S loop


Indexing Pointers

Description

Pointer

Syntax PointerModified

*R No

R can be any register

In this case the pointers are used but not modified.


Indexing Pointers

Description

Pointer

+ Pre-offset

- Pre-offset


*R

*+R[disp]

*-R[disp]

No

No

No

[disp] specifies the number of elements size in DW (64-bit), W (32-bit), H (16-bit), or B (8-bit).

disp = R or 5-bit constant. R can be any register.

In this case the pointers are modified BEFORE being used

and RESTORED to their previous values.


Indexing Pointers

Description

Pointer

+ Pre-offset

- Pre-offset

Pre-increment

Pre-decrement


*R

*+R[disp]

*-R[disp]

*++R[disp]

*--R[disp]

No

No

No

Yes

Yes

In this case the pointers are modified BEFORE being used

and NOT RESTORED to their Previous Values.


Indexing Pointers

Description

Pointer

+ Pre-offset

- Pre-offset

Pre-increment

Pre-decrement

Post-increment

Post-decrement


*R

*+R[disp]

*-R[disp]

*++R[disp]

*--R[disp]

*R++[disp]

*R--[disp]

No

No

No

Yes

Yes

Yes

Yes

In this case the pointers are modified AFTER being used

and NOT RESTORED to their Previous Values.


Indexing Pointers

Description

Pointer

+ Pre-offset

- Pre-offset

Pre-increment

Pre-decrement

Post-increment

Post-decrement


*R

*+R[disp]

*-R[disp]

*++R[disp]

*--R[disp]

*R++[disp]

*R--[disp]

No

No

No

Yes

Yes

Yes

Yes

[disp] specifies # elements - size in DW, W, H, or B. disp = R or 5-bit constant. R can be any register.


Modify and testing the code

This code now performs the following

operations:

a0*x0 + a1*x1 + a2*x2 + ... + aN*xN



loop LDH .D *A5++, A0

LDH .D *A6++, A1

MPY .M A0, A1, A3

ADD .L A4, A3, A4

SUB .S B0, 1, B0

[B0] B .S loop


Store the final result

This code now performs the following

operations:

a0*x0 + a1*x1 + a2*x2 + ... + aN*xN




LDH .D *A6++, A1

MPY .M A0, A1, A3

ADD .L A4, A3, A4

SUB .S B0, 1, B0

[B0] B .S loop

STH .D A4, *A7



The Pointer A7 has not been initialised.




LDH .D *A6++, A1

MPY .M A0, A1, A3

ADD .L A4, A3, A4

SUB .S B0, 1, B0

[B0] B .S loop

STH .D A4, *A7



The Pointer A7 is now initialised.





LDH .D *A6++, A1

MPY .M A0, A1, A3

ADD .L A4, A3, A4

SUB .S B0, 1, B0

[B0] B .S loop

STH .D A4, *A7


What is the initial value of A4?

A4 is used as an accumulator,

so it needs to be reset to zero.



MVKL .S2 pt3, A7MVKH .S2 pt3, A7MVKL .S2 count, B0ZERO .L A4


LDH .D *A6++, A1

MPY .M A0, A1, A3

ADD .L A4, A3, A4

SUB .S B0, 1, B0

[B0] B .S loop

STH .D A4, *A7


How can we add more processing

power to this processor?

.S1

.M1

.L1

.D1

A0A1A2A3A4

Register File A

.

.

.

Data Memory

A15

32-bits

Increasing the processing power!


(1) Increase the clock frequency.

.S1

.M1

.L1

.D1

A0A1A2A3A4

Register File A

.

.

.

Data Memory

A15

32-bits

Increasing the processing power!

(2) Increase the number of Processing units.


To increase the Processing Power, this processor has two sides (A and B or 1 and 2)

Data Memory

.S1

.M1

.L1

.D1

A0A1A2A3A4

Register File A

.

.

.

A15

32-bits

.S2

.M2

.L2

.D2

B0B1B2B3B4

Register File B

.

.

.

B15

32-bits


Can the two sides exchange operands in order to increase performance?

Data Memory

.S1

.M1

.L1

.D1

A0A1A2A3A4

Register File A

.

.

.

A15

32-bits

B15

.S2

.M2

.L2

.D2

B0B1B2B3B4

Register File B

.

.

.

32-bits


The answer is YES but there are limitations.

To exchange operands between the two sides, some cross paths or links are required.

What is a cross path? A cross path links one side of the CPU to

the other.

There are two types of cross paths:

Data cross paths.

Address cross paths.


Data Cross Paths

Data cross paths can also be referred to as register file cross paths.

These cross paths allow operands from one side to be used by the other side.

There are only two cross paths:

one path which conveys data from side B to side A, 1X.

one path which conveys data from side A to side B, 2X.


TMS320C67x Data-Path


Data Cross Paths

Data cross paths only apply to the .L, .S and .M units.

The data cross paths are very useful, however there are some limitations in their use.


Data Cross Path Limitations

A

2x

.L1

.M1

.S1

B

1x

<src>

<src><dst>

(1) The destination register must be on same side as unit.

(2) Source registers - up to one cross path per execute packet per side.

Execute packet: group of instructions that execute simultaneously.



A

2x

.L1

.M1

.S1

B

1x

<src>

<src><dst>

eg:ADD .L1x A0,A1,B2MPY .M1x A0,B6,A9SUB .S1x A8,B2,A8

|| ADD .L1x A0,B0,A2

|| Means that the SUB and ADD belong to the same fetch packet, therefore execute simultaneously.



A

2x

.L1

.M1

.S1

B

1x

<src>

<src><dst>

eg:ADD .L1x A0,A1,B2MPY .M1x A0,B6,A9SUB .S1x A8,B2,A8

|| ADD .L1x A0,B0,A2

NOT VALID!


Data Cross Paths for both sides

A

2x

.L1

.M1

.S1

B

1x

<src>

<src><dst>

.L2

.M2

.S2

<dst>

<src>

<src>


Address cross paths

.D1A

Addr

Data

LDW.D1T1 *A0,A5STW.D1T1 A5,*A0

(1) The pointer must be on the same side of the unit.


Load or store to either side

.D1A

*A0

B

Data1 A5

Data2 B5

DA1 = T1

DA2 = T2LDW.D1T1 *A0,A5LDW.D1T2 *A0,B5


Standard Parallel Loads

.D1A

A5

*A0

BB5

.D2

Data1

*B0

LDW.D1T1 *A0,A5|| LDW.D2T2 *B0,B5

DA1 = T1

DA2 = T2


Parallel Load/Store using address cross paths

.D1A

A5

*A0

BB5

.D2

Data1

*B0

LDW.D1T2 *A0,B5|| STW.D2T1 A5,*B0

DA1 = T1

DA2 = T2


Fill the blanks ... Does this work?

.D1A

*A0

B.D2

Data1

*B0

LDW.D1__ *A0,B5|| STW.D2__ B6,*B0

DA1 = T1

DA2 = T2


Not Allowed!

.D1A

*A0

BB5

B6

.D2

Data1

*B0

LDW.D1T2 *A0,B5|| STW.D2T2 B6,*B0

DA2 = T2


Not Allowed!Parallel accesses: both cross or neither cross

.D1A

*A0

BB5

B6

.D2

Data1

*B0

LDW.D1T2 *A0,B5|| STW.D2T2 B6,*B0

DA2 = T2


Conditions Don’t Use Cross Paths

If a conditional register comes from the opposite side, it does NOT use a data or address cross-path.

Examples:

[B2] ADD .L1 A2,A0,A4[A1] LDW .D2 *B0,B5


CPURef Guide

Full CPU Datapath(Pg 2-2)

‘C62x Data-Path Summary


‘C67x Data-Path Summary ‘C67x


Cross Paths - Summary

Data Destination register on same side as unit.

Source registers - up to one cross path per execute packet per side.

Use “x” to indicate cross-path.

Address Pointer must be on same side as unit. Data can be transferred to/from either side. Parallel accesses: both cross or neither cross.

Conditionals Don’t Use Cross Paths.


Code Review (using side A only)

MVK .S1 40, A2 ; A2 = 40, loop count

loop: LDH .D1 *A5++, A0 ; A0 = a(n)

LDH .D1 *A6++, A1 ; A1 = x(n)

MPY .M1 A0, A1, A3 ; A3 = a(n) * x(n)

ADD .L1 A3, A4, A4 ; Y = Y + A3

SUB .L1 A2, 1, A2 ; decrement loop count

[A2] B .S1 loop ; if A2 0, branch

STH .D1 A4, *A7 ; *A7 = Y

Y =40

an xnn = 1

*

Note: Assume that A4 was previously cleared and the pointers are initialised.


Let us have a look at the final details concerning the functional units.

Consider first the case of the .L and .S units.


So where do the 40-bit registers come from?

Operands - 32/40-bit Register, 5-bit Constant

Operands can be: 5-bit constants (or 16-bit for MVKL and MVKH).

32-bit registers.

40-bit Registers.

However, we have seen that registers are only 32-bit.


A 40-bit register can be obtained by concatenating two registers.

However, there are 3 conditions that need to be respected:

The registers must be from the same side.

The first register must be even and the second odd.

The registers must be consecutive.



A1:A0

A3:A2

A5:A4

A7:A6

A9:A8

A11:A10

A13:A12

A15:A14

odd even:328

40-bit Reg

B1:B0

B3:B2

B5:B4

B7:B6

B9:B8

B11:B10

B13:B12

B15:B14

odd even:328

40-bit Reg


All combinations of 40-bit registers are shown below:


32-bitReg

40-bitReg

< src > < src >

32-bitReg

5-bitConst

32-bitReg

40-bitReg

< dst >

.L or .S


instr .unit <src>, <src>, <dst>




32-bitReg

40-bitReg

< src > < src >

32-bitReg

5-bitConst

32-bitReg

40-bitReg

< dst >

.L or .S



OR.L1 A0, A1, A2


32-bitReg

40-bitReg

< src > < src >

32-bitReg

5-bitConst

32-bitReg

40-bitReg

< dst >

.L or .S



OR.L1 A0, A1, A2

ADD.L2 -5, B3, B4


32-bitReg

40-bitReg

< src > < src >

32-bitReg

5-bitConst

32-bitReg

40-bitReg

< dst >

.L or .S



OR.L1 A0, A1, A2

ADD.L2 -5, B3, B4ADD.L1 A2, A3, A5:A4


32-bitReg

40-bitReg

< src > < src >

32-bitReg

5-bitConst

32-bitReg

40-bitReg

< dst >

.L or .S



OR.L1 A0, A1, A2

ADD.L2 -5, B3, B4ADD.L1 A2, A3, A5:A4

SUB.L1 A2, A5:A4, A5:A4


32-bitReg

40-bitReg

< src > < src >

32-bitReg

5-bitConst

32-bitReg

40-bitReg

< dst >

.L or .S



OR.L1 A0, A1, A2

ADD.L2 -5, B3, B4ADD.L1 A2, A3, A5:A4

SUB.L1 A2, A5:A4, A5:A4ADD.L2 3, B9:B8, B9:B8


32-bitReg

40-bitReg

< src > < src >

32-bitReg

5-bitConst

32-bitReg

40-bitReg

< dst >

.L or .S


To move the content of a register (A or B) to another register (B or A) use the move “MV” Instruction, e.g.:

MV A0, B0

MV B6, B7

To move the content of a control register to another register (A or B) or vice-versa use the MVC instruction, e.g.:

MVC IFR, A0

MVC A0, IRP

Register to register data transfer


TMS320C6211/6711 Instruction Set


'C6211 Instruction Set (by category)

ArithmeticABSADDADDAADDKADD2MPYMPYHNEGSMPYSMPYHSADDSATSSUBSUBSUBASUBCSUB2

Program CtrlBIDLENOP

LogicalANDCMPEQCMPGTCMPLTNOTORSHLSHRSSHLXOR

Data Mgmt

LDB/H/WMVMVCMVKMVKLMVKHMVKLHSTB/H/W

Bit MgmtCLREXTLMBDNORMSET

Note: Refer to the 'C6000 CPU Reference Guide for more details.


'C6211 Instruction Set (by unit).S Unit

MVKLHNEGNOT ORSETSHLSHRSSHLSUBSUB2XORZERO

ADDADDKADD2ANDBCLREXTMVMVCMVKMVKLMVKH

.M Unit

SMPYSMPYH

MPYMPYH

.L Unit

NOTORSADDSATSSUBSUBSUBCXORZERO

ABSADDANDCMPEQCMPGTCMPLTLMBDMVNEGNORM

.D Unit

STB/H/WSUBSUBAZERO

ADDADDALDB/H/WMVNEGOther

IDLENOPNote: Refer to the 'C6000 CPU

Reference Guide for more details.


‘C6711 Additional Instructions (by unit)

.S Unit

CMPLTDPRCPSPRCPDPRSQRSPRSQRDPSPDP

ABSSPABSDPCMPGTSPCMPEQSPCMPLTSPCMPGTDPCMPEQDP.M Unit

MPYI MPYID

MPYSPMPYDP

.L Unit

INTSPINTSPUSPINTSPTRUNCSUBSPSUBDP

ADDDPADDSPDPINTDPSPINTDPINTDPU

.D Unit

ADDAD LDDW

Note: Refer to the 'C6000 CPU Reference Guide for more details.

‘C67x


TMS320C6211/6711 Memory Map


‘C6211 Memory MapByte Address

FFFF_FFFF

0000_000064K x 8 Internal

(L2 cache)

Internal Memory Unified (data or prog) 4 blocks - each can be

RAM or cache

On-chip Peripherals0180_0000

External Memory Async (SRAM, ROM, etc.)

Sync (SBSRAM, SDRAM)

256M x 8 External2

256M x 8 External3

8000_0000

9000_0000

A000_0000

B000_0000

256M x 8 External0

256M x 8 External1Level 1 Cache

4KB Program 4KB Data Not in map CPU L2

64K

4KP

4KD


TMS320C6211/6711 Peripherals


'C6x System Block Diagram

PERIPHERALS

Memory

Internal BusesExternalMemory

CPU

.D1

.M1

.L1

.S1

.D2

.M2

.L2

.S2

Regs (B

0-B15)

Regs (A

0-A15)

Control Regs


‘C6x Internal Buses

A

D

InternalMemory

x32AD

ExternalInterface

A

Dx32

Peripherals

can perform 64-bit data loads.‘C67x

Data Addr - T1 x32

Data Data - T1 x32/64

Data Addr - T2 x32

Data Data - T2 x32/64

Aregs

Bregs

Program Addr x32

Program Data x256PC

DMA Addr - Read x32

DMA Data - Read x32

DMA Addr - Write x32

DMA Data - Write x32

DMA


PERIPHERALS

Memory


Internal Buses

CPU

.D1

.M1

.L1

.S1

.D2

.M2

.L2

.S2

Regs (B

0-B15)

Regs (A

0-A15)

Control Regs

EMIF

Ext’lMemory

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002Chapter 1, Slide 119‘C6202

16M x 8 External0

Int’l Prog (64K instr)

Int’l Data (128K bytes)

On-chip Peripherals

4M x 8 External1

16M x 8 External2

16M x 8 External3

64K x 8 Internal

(L2 cache)

On-chip Peripherals

256M x 8 External2

256M x 8 External3

256M x 8 External0

256M x 8 External1

‘C6211

‘C6201/11 Memory Maps


PERIPHERALS


Internal Buses

CPU

.D1

.M1

.L1

.S1

.D2

.M2

.L2

.S2

Regs (B

0-B15)

Regs (A

0-A15)

Control Regs

EMIF

Ext’lMemory

- Sync

- Async

ProgramRAM Data Ram

Addr

D (32)


'C6x Peripherals

‘C6x

CPU

EMIF

DMA

Boot

ExternalMemory

EMIF (External Memory

Interface)

- Glueless access to async/sync

memory

EPROM, SRAM, SDRAM,

SBSRAM

DMA/EDMA (Enhance Direct

Memory Acces)

McBSP

HPI/XB

Timer

PLL

McBSP (Multi-Channel

Buffered Serial Port)

- High speed sync serial

comm

- T1/E1/MVIP interface

HPI (Host Port Interface)

/Expansion Bus (XB)

- 16/32-bit host P access


Clocking - Basic Definitions

What is a “clock cycle”?‘C6x‘C6x

CLKINCLKOUT2 (1/2 CLKOUT1)

CLKOUT1 (‘C6x clock cycle)PLL

x1x4

CLKIN - MHz PLL CLKOUT2 - MHz MIPs (max)

250 x1 125 2000

200 x1 100 1600

50 x4 100 1600

25 x4 50 800

CLKOUT1 - MHz

250 (4ns)

200 (5ns)

200

100 (10ns)

When we talk about cycles ...


'C6x System Block Diagram (Final)

Internal Buses

CPU

.D1

.M1

.L1

.S1

.D2

.M2

.L2

.S2

Regs (B

0-B15)

Regs (A

0-A15)

Control Regs

EMIF

Ext’lMemory

- Sync- Async

ProgramRAM

Data Ram

D (32)

Serial Port

Host Port

Boot Load

Timers

Pwr Down

DMA

Addr


ProgramMemory

DataMemory

L2 Memory

’C6201B 64KB1 blk Pgm/Cache

64KB2 blks4 banks ea

External

’C6701 64KB1 blk Pgm/Cache

64KB2 blks8 banks ea

External

’C6202 256KB1 blk Pgm/Cache1 blk Mapped Pgm

128 KB2 blks4 banks ea

External

’C6211/C6711 4 KB1 blk Cache

4 KB1 blk Cache

64 KB4 blk MappedCache

L1 Memory

Internal Memory Summary

TMS320C67x DSP GenerationParametric Table

TMS320C64x DSP Generation Parametric Table



‘C6000 Device Summary

Device MIPS MHz Kbytes pins mm W $ Periphs

6201B 1600 200 128 352 27 1.9 80-110 D2H

6202 2000 250 384 352 27 1.9 120-150 D3X

6211 1200 150 72 256 27 1.5 20-40 E2H

TMS320 MFLOPS MHz Kbytes pins mm W $ Periphs

6701 1000 167 128 352 35 1.9 170-200 D2H

6711 600 100 72 256 27 0.9 20-40 E2H

Peripherals Legend:D,E: DMA,EDMA2,3: # of McBSPs

H,X: HPI, XBUS


TMS320C67x DSP GenerationParametric Table



’C6000 History

6201 r1 1Q97 coincident ‘C6x architectural announcement. Sample CPU core, minimal peripherals.

6201 r2 4Q97. Full production, with peripherals.

6201B 4Q98. Power reduced, .18 micron silicon, double ports into internal data memory.

6701 3Q98. Pin-for-pin compatible floating-point version of ‘C6201. 1GFLOP (@ 167MHz) performance.

6202 2Q99. 2000 MIPS @ 250MHz. 2-3x 6201 on-chip memory.

Replaced HPI with Expansion Bus (32-bit HPI + more).

6211 3Q99. 2 cents per MIPS! 1200MIPS @ 150MHz as low as $25. Double-level cache, enhanced DMA.

6711 Announced 3/1/99. 6701 floating-point CPU with 6211-like memory/peripherals. Volume pricing under $20.


‘C6x Family Part Numbering

Example = TMS320LC6201PKGA200 TMS320 = TI DSP

L = Place holder for voltage levels

C6 = C6x family

2 = Fixed-point core

01 = Memory/peripheral configuration

PKG = Pkg designator (actual letters TBD)

A = -40 to 85C (blank for 0 to 70C)

200 = Core CPU speed in Mhz


Device Summary Table

Device Int Mem Ext Mem Peripherals

6201/6701 64K Data 3 x 16M DMA16K Instr 1 x 4M 2 McBSP

HPI (16-bit)2 Timer/Counters (32-bit)

6202 128K Data 3 x 16M DMA48K Instr 1 x 4M 2 McBSP

4 x 256M ---- XBus (32-bit)2 Timer/Counters (32-bit)

6211 4K Data Cache 4 x 256M EDMA4K Prog Cache 2 McBSP 64K RAM/Cache HPI (16-bit)

2 Timer/Counters (32-bit)


Module 1 Exam1. Functional Units

a. How many can perform an ADD? Name them.

b. Which support memory loads/stores?

.M .S .D .L

six; .L1, .L2, .D1, .D2, .S1, .S2

2. Memory Map

a. How many external ranges exist on ‘C6201?

Four


3. Conditional Codea. Which registers can be used as cond’l registers?

b. Which instructions can be conditional?

4. Performancea. What is the 'C6711 instruction cycle time?

b. How can the 'C6711 execute 1200 MIPs?

A1, A2, B0, B1, B2

All of them

CLKOUT1

1200 MIPs = 8 instructions (units) x 150 MHz


5. Coding Problems

a. Move contents of A0-->A1

MV .L1 A0, A1or ADD .S1 A0, 0, A1or MPY .M1 A0, 1, A1(what’s the problem

with this?)


5. Coding Problems

a. Move contents of A0-->A1

b. Move contents of CSR-->A1

c. Clear register A5

MV .L1 A0, A1or ADD .S1 A0, 0, A1or MPY .M1 A0, 1, A1

ZERO .S1 A5or SUB .L1 A5, A5, A5or MPY .M1 A5, 0, A5or CLR .S1 A5, 0, 31, A5or MVK .S1 0, A5or XOR .L1 A5,A5,A5

(A0 can only be a 16-bit value)

MVC CSR, A1


5. Coding Problems (cont’d)

d. A2 = A02 + A1

e. If (B1 0) then B2 = B5 * B6

f. A2 = A0 * A1 + 10

g. Load an unsigned constant (19ABCh) into register A6.

MPY.M1 A0, A0, A2ADD.L1 A2, A1, A2

[B1] MPY.M2 B5, B6, B2

mvkl .s1 0x00019abc,a6mvkh .s1 0x00019abc,a6

value .equ 0x00019abc

mvkl.s1 value,a6mvkh.s1 value,a6

MPY A0, A1, A2ADD 10, A2, A2


5. Coding Problems (cont’d)

h. Load A7 with contents of mem1 and post-increment the selected pointer.

x16 mem

mem1 10h

A7

load_mem1: MVKL .S1 mem1, A6

MVKH .S1 mem1, A6

LDH .D1 *A6++, A7


WEEK1

TMS320C6000 Introduction andArchitectural Overview

- End -