TMS320C64X

TMS320C64x

• TMS320C64x is a family of 16-bit Very Long Instruction Word (VLIW) DSP from Texas Instruments

• At clock rates of up to 1 GHz, C64x DSPs can process information at rates up to 8000 MIPS

• C64x DSPs can do more work each cycle with built-in extensions.

• They can process all C62x object code unmodified (but not vice-versa)

Applications for the C64xApplications for the C64x

TMS320C64x can be used as a CPU in the following devices:

Wireless local base stations;

Remote access server (RAS);

Digital subscriber loop (DSL) systems;

Cable modems;

Multichannel telephony systems;

Pooled modems;

New extensions

• Register file enhancements• Data path extensions• Packed data processing• Additional functional unit hardware• Increased orthogonality

Register file enhancements

• The ’C64x register file has double the number of general-purpose registers than the ’C62x/’C67x cores

• There are 32 32-bit registers per data path A0-A31 for file A and B0-B31 for file B

• A0 may also be used as a condition register bringing the total to six condition registers.

• In all ’C6000 devices, registers A4-A7 and B4-B7 can be used for circular addressing.

Packed data processing

• The ’C64x register file supports all the ’C62x data types and extends this by additionally supporting packed 8-bit types and 64-bit fixed-point data types.

• Packed data types store either four 8-bit values or two 16-bit values in a single 32-bit register or four 16-bit values in a 64-bit register pair.

• Besides being able to perform all the ’C62x instructions, the ’C64x also contains many 8–bit and 16–bit extensions to the instruction set.

Eg: MPYU4 instruction performs four 8x8 unsigned multiplies with a single instruction on a .M unit.

Data path extensions

• On the ’C64x, all eight of the functional units have access to the register file on the opposite side via a cross path.

• on the ’C62x/’C67x, only six functional units have access to the register file on the opposite side via a cross path; the .D units do not have a data cross path.

• The ’C64x pipelines data cross path accesses allowing multiple units per side to read the same cross path source simultaneously.

• In ’C62x/’C67x, only one functional unit per data path per execute packet could get an operand from the opposite register file.

Additional Functional Unit Hardware

• the .L units can perform byte shifts and the .M units can perform bi-directional variable shifts in addition to the .S unit’s ability to do shifts.

• Bit-count and rotate hardware on the .M unit extends support for bit-level algorithms such as binary morphology, image metric calculations and encryption algorithms.

Increased Orthogonality

• The .D unit can now perform 32-bit logical instructions in addition to the .S and .L units.

• Also, the .D unit now directly supports load and store instructions for double-word data values

Block diagramBlock diagram

Enhanced

DMA

Controller

(64-channel)

ZBT RAM

SDRAM

SBSRAM

FIFO

SRAM

I/O devices

L2

Memory

1024K

bytes

L1 Program cacheDirect-mapped16 K Bytes total

EMIF A

EMIF B

.

L1 Data cache2-way set-associative

16 K Bytes total

CPU CORE

C64X CPU

Architecture Overview

• 2 (almost) identical fixed-point data paths that each contain– 1 ALU (The .L Unit)– 1 Shifter (The .S Unit)– 1 Multiplier (The .M Unit)– 1 Adder/Subtractor used for address

generation (The .D Unit)– 1 register file containing thirty-two 32-bit

registers

• The 8 execution units in the 2 data paths are capable of executing up to 8 instructions in parallel.

• Can operate on 8-, 16-, 32-, and 40-bit data

• Can perform double-word (64-bit) loads and stores by using 2 registers for the one operation.

General-Purpose Register FilesGeneral-Purpose Register Files

The C64x register file contains 32 32-bit registers (A0-A31 for file A and B0-B31 for file B);

can be used for data, pointers or conditions

Values larger than 32 bits (40-bit long and 64-bit float quantities) are stored in register pairs.

Packed data types are: four 8-bit values or two 16-bit values in a single 32-bit register, four 16-bit values in a 64-bit register pair.

Zero filled

Odd register Even register3239 31 0

Delay Slots

• Delay slots mean “how many CPU cycles come between the current instruction and when the results of the instruction can be used by another instruction”

• Single Cycle Instructions: 0 delay slots• 16x16 Single Multiply and .M Unit non-multiply

Instructions: 1 delay slot

• Store: 0 delay slots– If a load occurs before a store (either in parallel or not),

then the old data is loaded from memory before the new data is stored.

– If a load occurs after a store, (either in parallel or not), then the new data is stored before the data is loaded.

• C64x Multiply Extensions: 3 delay slots• Load: 4 delay slots• Branch: 5 delay slots

– The branch target is in the PG slot when the branch condition is determined in E1. There are 5 slots between PG and E1 when the branch target begins executing useful code again.

Memory The C64x has different spaces for program and data memory;

Uses two-level cache memory scheme;

Internal MemoryInternal Memory

The C64x has a 32-bit byte-addressable memory with the following features:

Separate data and program address spaces;

Large on chip RAM, up to 7MB;

2-level cache;

Single internal program memory port with an instruction-fetch bandwidth of 256 bits;

Two 64-bit internal data memory ports;

Memory Map (Internal and External Memory)

• Level 1 Program Cache is 128 Kbit direct mapped

• Level 1 Data cache is 128Kbit 2-way set-associative

• Shared Level 2 Program/Data Memory/Cache of 4Mbit – Can be configured as mapped memory– Cache (up to 256 Kbytes)– Combination of the two

Memory Buses

• Instruction fetch using 32-bit address bus and 256-bit data bus

• two 64-bit load buses (LD1 and LD2)

• two 64-bit store buses (ST1 and ST2)

Interrupts

• 16 prioritized interrupts: INT_00 to INT_15• INT_00 has the highest priority and is dedicated

to RESET. This halts the CPU and returns it to a known state

• The first four interrupts (INT_00 – INT_03) are fixed and non maskable

• INT_01 – INT_03 are generally used to alert the CPU of an impending hardware problem, such as an imminent power failure

• The remaining interrupts are maskable and can be programmed

Interrupt Performance Consideration

• Overhead for all CPU interrupts is 7 cycles

• Interrupt latency is 11 cycles

• Interrupts can be recognized every 2 cycles

• 2 occurrences of a specific interrupt can be recognized in 2 cycles

Peripheral Set

• 2 multichannel buffered audio serial ports• 2 inter-integrated circuit bus modules (I2Cs)• 3 multichannel buffered serial ports (McBSPs)• 3 32-bit general-purpose timers• 1 user-configurable 16-bit or 32-bit host-port interface

(HPI16/HPI32)• 1 16-pin general-purpose input/output port (GP0) with

programmable interrupt/event generation modes• 1 32-bit glueless external memory interface (EMIFA),

capable of interfacing to synchronous and asynchronous memories and peripherals.

ZBT RAM

• Zero Bus Turnaround (ZBT) is a synchronous SRAM architecture optimized for networking and telecommunications applications.

• It can increase the internal bandwidth of a switch fabric when compared to standard SyncBurst SRAM.

• The ZBT architecture is optimized for switching and other applications with highly random READs and WRITEs.

• ZBT SRAMs eliminate all idle cycles when turning the data bus around from a WRITE operation to a READ operation

Packaging – Top View

A1 Corner

Top View

Packaging - Bottom View

Bottom View

Sum of products exampleSum of products example

C code:

int DotP(short* m, short* n, int count) { int i, product, sum = 0; for(i = 0; i < count; i++) { product = m[i] * n[i]; sum+=product; } return(sum);}

TI TMS C64x code:

LOOP:

[A0] SUB .L1 A0, 1, A0

| | [!A0] ADD .S1 A6, A5, A5

| | MPY .M1X B4, A4, A6

| | [B0] BDEC .S2 LOOP, B0

LDH .D1T1 *A3++, A4

LDH .D2T2 *B5++, B4

Another code example Another code example MIPS:

loop: LW R1, 0(R11) MUL R2, R1, R10 SW R2, 0(R12) ADDI R12, R12, #-4 ADDI R11, R11, #-4 BGTZ R12, loop

TI TMS C64x:

ADDK .S1 #-4,A11 || LDW .D1 A1,0(A11) || MVK .S2 #-4,B1

ADDK .S1 #-4,A11 || LDW .D1 A1,0(A11) || MUL .M1 A1,A10,A2 || ADDK .S2 #-12,B12

loop: ADDK .S1 #-4,A11 || LDW .D1 A1,0(A11) || MUL .M1 A1,A10,A2 || STW .D2x A2,0(B12) ||

ADD .L2 B12,B1,B12 || BGTZ .S2 B12, loop

ADD .L2 B12, B1, B12 || MUL .M1 A1,A10,A2 || STW .D2x A2,0(B12)

ADD .L2 B12, B1, B12 || STW .D2x A2,0(B12)

Special purpose instructions

GSMSigned variable shiftSSHVL, SSHVR

Motion estimationQuad 8-bit Absolute of differences

SUBABS4

Motion compensationQuad 8-bit, Dual 16-bit averageAVGx

AudioExtended precision 16x32 MPYsMPYHIx, MPYLIx

GraphicsBit expansionXPNDx

Endian swapByte swapSWAP4

Cable modemBit de-interleavingDEAL

Convolution encoderBit interleavingSHFL

Reed Solomon supportGalois Field MPYGMPY4

Machine visionBit counterBITC4

Example ApplicationDescriptionInstruction

THE END

Education

TMS320C64X