View
240
Download
3
Tags:
Embed Size (px)
DESCRIPTION
TMS320c64x features
Citation preview
TMS320C64x
• TMS320C64x is a family of 16-bit Very Long Instruction Word (VLIW) DSP from Texas Instruments
• At clock rates of up to 1 GHz, C64x DSPs can process information at rates up to 8000 MIPS
• C64x DSPs can do more work each cycle with built-in extensions.
• They can process all C62x object code unmodified (but not vice-versa)
Applications for the C64xApplications for the C64x
TMS320C64x can be used as a CPU in the following devices:
Wireless local base stations;
Remote access server (RAS);
Digital subscriber loop (DSL) systems;
Cable modems;
Multichannel telephony systems;
Pooled modems;
New extensions
• Register file enhancements• Data path extensions• Packed data processing• Additional functional unit hardware• Increased orthogonality
Register file enhancements
• The ’C64x register file has double the number of general-purpose registers than the ’C62x/’C67x cores
• There are 32 32-bit registers per data path A0-A31 for file A and B0-B31 for file B
• A0 may also be used as a condition register bringing the total to six condition registers.
• In all ’C6000 devices, registers A4-A7 and B4-B7 can be used for circular addressing.
Packed data processing
• The ’C64x register file supports all the ’C62x data types and extends this by additionally supporting packed 8-bit types and 64-bit fixed-point data types.
• Packed data types store either four 8-bit values or two 16-bit values in a single 32-bit register or four 16-bit values in a 64-bit register pair.
• Besides being able to perform all the ’C62x instructions, the ’C64x also contains many 8–bit and 16–bit extensions to the instruction set.
Eg: MPYU4 instruction performs four 8x8 unsigned multiplies with a single instruction on a .M unit.
Data path extensions
• On the ’C64x, all eight of the functional units have access to the register file on the opposite side via a cross path.
• on the ’C62x/’C67x, only six functional units have access to the register file on the opposite side via a cross path; the .D units do not have a data cross path.
• The ’C64x pipelines data cross path accesses allowing multiple units per side to read the same cross path source simultaneously.
• In ’C62x/’C67x, only one functional unit per data path per execute packet could get an operand from the opposite register file.
Additional Functional Unit Hardware
• the .L units can perform byte shifts and the .M units can perform bi-directional variable shifts in addition to the .S unit’s ability to do shifts.
• Bit-count and rotate hardware on the .M unit extends support for bit-level algorithms such as binary morphology, image metric calculations and encryption algorithms.
Increased Orthogonality
• The .D unit can now perform 32-bit logical instructions in addition to the .S and .L units.
• Also, the .D unit now directly supports load and store instructions for double-word data values
Block diagramBlock diagram
Enhanced
DMA
Controller
(64-channel)
ZBT RAM
SDRAM
SBSRAM
FIFO
SRAM
I/O devices
L2
Memory
1024K
bytes
L1 Program cacheDirect-mapped16 K Bytes total
EMIF A
EMIF B
.
L1 Data cache2-way set-associative
16 K Bytes total
CPU CORE
C64X CPU
Architecture Overview
• 2 (almost) identical fixed-point data paths that each contain– 1 ALU (The .L Unit)– 1 Shifter (The .S Unit)– 1 Multiplier (The .M Unit)– 1 Adder/Subtractor used for address
generation (The .D Unit)– 1 register file containing thirty-two 32-bit
registers
• The 8 execution units in the 2 data paths are capable of executing up to 8 instructions in parallel.
• Can operate on 8-, 16-, 32-, and 40-bit data
• Can perform double-word (64-bit) loads and stores by using 2 registers for the one operation.
General-Purpose Register FilesGeneral-Purpose Register Files
The C64x register file contains 32 32-bit registers (A0-A31 for file A and B0-B31 for file B);
can be used for data, pointers or conditions
Values larger than 32 bits (40-bit long and 64-bit float quantities) are stored in register pairs.
Packed data types are: four 8-bit values or two 16-bit values in a single 32-bit register, four 16-bit values in a 64-bit register pair.
Zero filled
Odd register Even register3239 31 0
Delay Slots
• Delay slots mean “how many CPU cycles come between the current instruction and when the results of the instruction can be used by another instruction”
• Single Cycle Instructions: 0 delay slots• 16x16 Single Multiply and .M Unit non-multiply
Instructions: 1 delay slot
• Store: 0 delay slots– If a load occurs before a store (either in parallel or not),
then the old data is loaded from memory before the new data is stored.
– If a load occurs after a store, (either in parallel or not), then the new data is stored before the data is loaded.
• C64x Multiply Extensions: 3 delay slots• Load: 4 delay slots• Branch: 5 delay slots
– The branch target is in the PG slot when the branch condition is determined in E1. There are 5 slots between PG and E1 when the branch target begins executing useful code again.
Memory The C64x has different spaces for program and data memory;
Uses two-level cache memory scheme;
Internal MemoryInternal Memory
The C64x has a 32-bit byte-addressable memory with the following features:
Separate data and program address spaces;
Large on chip RAM, up to 7MB;
2-level cache;
Single internal program memory port with an instruction-fetch bandwidth of 256 bits;
Two 64-bit internal data memory ports;
Memory Map (Internal and External Memory)
• Level 1 Program Cache is 128 Kbit direct mapped
• Level 1 Data cache is 128Kbit 2-way set-associative
• Shared Level 2 Program/Data Memory/Cache of 4Mbit – Can be configured as mapped memory– Cache (up to 256 Kbytes)– Combination of the two
Memory Buses
• Instruction fetch using 32-bit address bus and 256-bit data bus
• two 64-bit load buses (LD1 and LD2)
• two 64-bit store buses (ST1 and ST2)
Interrupts
• 16 prioritized interrupts: INT_00 to INT_15• INT_00 has the highest priority and is dedicated
to RESET. This halts the CPU and returns it to a known state
• The first four interrupts (INT_00 – INT_03) are fixed and non maskable
• INT_01 – INT_03 are generally used to alert the CPU of an impending hardware problem, such as an imminent power failure
• The remaining interrupts are maskable and can be programmed
Interrupt Performance Consideration
• Overhead for all CPU interrupts is 7 cycles
• Interrupt latency is 11 cycles
• Interrupts can be recognized every 2 cycles
• 2 occurrences of a specific interrupt can be recognized in 2 cycles
Peripheral Set
• 2 multichannel buffered audio serial ports• 2 inter-integrated circuit bus modules (I2Cs)• 3 multichannel buffered serial ports (McBSPs)• 3 32-bit general-purpose timers• 1 user-configurable 16-bit or 32-bit host-port interface
(HPI16/HPI32)• 1 16-pin general-purpose input/output port (GP0) with
programmable interrupt/event generation modes• 1 32-bit glueless external memory interface (EMIFA),
capable of interfacing to synchronous and asynchronous memories and peripherals.
ZBT RAM
• Zero Bus Turnaround (ZBT) is a synchronous SRAM architecture optimized for networking and telecommunications applications.
• It can increase the internal bandwidth of a switch fabric when compared to standard SyncBurst SRAM.
• The ZBT architecture is optimized for switching and other applications with highly random READs and WRITEs.
• ZBT SRAMs eliminate all idle cycles when turning the data bus around from a WRITE operation to a READ operation
Packaging – Top View
A1 Corner
Top View
Packaging - Bottom View
Bottom View
Sum of products exampleSum of products example
C code:
int DotP(short* m, short* n, int count) { int i, product, sum = 0; for(i = 0; i < count; i++) { product = m[i] * n[i]; sum+=product; } return(sum);}
TI TMS C64x code:
LOOP:
[A0] SUB .L1 A0, 1, A0
| | [!A0] ADD .S1 A6, A5, A5
| | MPY .M1X B4, A4, A6
| | [B0] BDEC .S2 LOOP, B0
LDH .D1T1 *A3++, A4
LDH .D2T2 *B5++, B4
Another code example Another code example MIPS:
loop: LW R1, 0(R11) MUL R2, R1, R10 SW R2, 0(R12) ADDI R12, R12, #-4 ADDI R11, R11, #-4 BGTZ R12, loop
TI TMS C64x:
ADDK .S1 #-4,A11 || LDW .D1 A1,0(A11) || MVK .S2 #-4,B1
ADDK .S1 #-4,A11 || LDW .D1 A1,0(A11) || MUL .M1 A1,A10,A2 || ADDK .S2 #-12,B12
loop: ADDK .S1 #-4,A11 || LDW .D1 A1,0(A11) || MUL .M1 A1,A10,A2 || STW .D2x A2,0(B12) ||
ADD .L2 B12,B1,B12 || BGTZ .S2 B12, loop
ADD .L2 B12, B1, B12 || MUL .M1 A1,A10,A2 || STW .D2x A2,0(B12)
ADD .L2 B12, B1, B12 || STW .D2x A2,0(B12)
Special purpose instructions
GSMSigned variable shiftSSHVL, SSHVR
Motion estimationQuad 8-bit Absolute of differences
SUBABS4
Motion compensationQuad 8-bit, Dual 16-bit averageAVGx
AudioExtended precision 16x32 MPYsMPYHIx, MPYLIx
GraphicsBit expansionXPNDx
Endian swapByte swapSWAP4
Cable modemBit de-interleavingDEAL
Convolution encoderBit interleavingSHFL
Reed Solomon supportGalois Field MPYGMPY4
Machine visionBit counterBITC4
Example ApplicationDescriptionInstruction
THE END