UTDSP: A VLIW Programmable DSP Processor

1/29

UTDSP: A VLIW Programmable DSP Processor

Sean Hsien-en Peng

Department of Electrical and Computer Engineering

University of Toronto

October 26th, 1999

2/29

Outline

<Background and Motivation=Tightly-encoded vs. VLIW

<A Novel Long-Instruction Packing Scheme=Better than TI VelociTI packing scheme

<Architecture Simulator & GUI Debugger=Minimize design gap, optimize kernel code

<System Design and VLSI Implementation=Design capture and a novel CAD flow

<Conclusions and Future Work

3/29

DSP: Need High Performance & Low Costs

<Specialized uProcessor for DSP algorithms

<Tightly-encoded ISA=Small storage/memory bandwidth

=Accumulator based Instructions

=Not suited for HLL compilers

<Alternative: VLIW architecture=Multiple FUs => High Performance

=Static Scheduling => Low cost

=Easy targets for compilers to exploit ILP

=UTDSP uses VLIW Architecture

4/29

UTDSP System

5/29

Current Limitations for VLIWs in Cost-Sensitive Systems

< Instruction memory is substantially increased=Unused encoding slots

=Loop unrolling to exploit ILP

<Very high instruction fetching bandwidth=A severe problem when off-chip instruction memory is

used and pin-count, packaging are major constraints.

6/29

TI VelociTI Architecture

< Instruction Packing=Reduce Storage

Requirements

=Unable to reduce bandwidth

=Use of off-chip memory degrades performance

= Introduce crossbar delay

7/29

UTDSP Two-Level Instruction Fetching

< Reduce instruction bandwidth to 32 bits=Enable the use of off-

chip instruction memory.

< Minimize decoder memory (90:10 rule)

< Eliminate the use of crossbar and extra decoding logic.

< Allow a novel packing method to reduce storage requirement

8/29

UTDSP Packing Method

9/29

Denser Packing using Two Clusters

10/29

Slot Sharing: Achieve Ideal Packing

< Same operations share one slot.

< Further reduce decoder memory size of two-cluster packing by 10%

< Achieve better-than-ideal packing result= Ideal: No NOPs in

decoder memory

< No extra hardware< Further improve the

result using clever register allocation

11/29

Implementation of UTDSP Packer Software System

12/29

Packing Performance: UTDSP vs. VelociTI

< Better packing rate than VelociTI algorithm

< Reduce instruction bandwidth to 32 bits=Can use off-chip

instruction memory

< No crossbar needed

< Only need 27% storage overhead to have the speedup of VLIW design= 6 in hand-crafted

kernels, 2.6 on average

Average Storage Requirement

12 Compiler-Generated benchmarks

1

1.31.27

0

0.2

0.4

0.6

0.8

1

1.2

1.4

Un

i-OP

UT

DS

PP

acker

Velo

ciTI

13/29

Architecture Simulator and GUI Assembly Debugger

< Architecture simulator evaluates design trade-offs in early stage

< GUI-based assembly debugger helps assembly coding

14/29

OO Model Using Java: Minimize Design Gap

< An Object can model any class of automata in Chomsky hierarchy= Data members model state

memory

= Class methods model state transitions

< Build corresponding digital hardware by 1-1 mapping= Translate object partition to

digital component partition

= Translate message protocol to signal interface

< Identical partition and block interface between two models are obtained

15/29

GUI-based Assembly Debugger is Free

< Adding self-displaying, event-listening abilities to the objects in UTDSP OO model => Debugger need not be modified when there is a change made to the architecture simulator

16/29

UTDSP Hardware Design

17/29

Combine EX/MEM to Reduce RAW and Bypassing

<Bypassing becomes a costly function in VLIW=Complexity of comparators/bypassing buses is

O(dn2), n EUs, d = # of stages between ID and WB

<Combine EX/MEM to reduce RAWs and bypassing hardware by 50%=All the RAW hazards can be solved using bypassing

<Trade-off: Eliminate displacement mode to not increase pipeline latency=MUs initiate memory access from very beginning of

EX stage

18/29

UTDSP Instruction Set: 69 Instructions<Memory Instructions: (MU1, MU2)

=Address Indirect addressing mode=MU1 is associated with DataMem X, MU2 with DataMem Y

<Address Instructions: (AU1, AU2)=Modulo addressing mode=Use start/end register implementation to reduce # of

ports on register files

<Integer Instructions: (DU1, DU2)=MAC instructions for integer and 1.15 fixed-point format

<Control Instructions: (PCU)=Branch, Jump, JSR ( 2 cycles branch delay )=Zero-overhead hardware loops ( nested up to 5 levels )

4Allow branches, interrupts in the inner loops

19/29

Optimize Kernel Performance Using Hardware Loops< Zero-overhead, nested up to 5 level, interruptable.

< for (I=0;I++;I<1000) { r3 = r2 + r1; r5 = r3 + r4 }

20/29

Design Challenge: The PC Unit< Handles not only RISC tasks but also nestable,

interruptable, zero-overhead hardware loops

21/29

VLSI Implementation: Challenge in Design Capture and CAD methodology

< Design Capture=Pipelined design, modeled in 10,100 lines RTL VHDL

=Use SRAM macros from CMC to build memory blocks

< Need a novel Hierarchical CAD flow= Interconnect delays decrease at 50,000-gate block level

as feature size shrinks

4Need logic grouping and merging to partition the chip and localize critical paths

=Minimize top-level routing between blocks

4Need global pin optimization

= Increase routing utilization rate

4Need hierarchical P&R flow and area-based router

22/29

Hierarchical CAD Flow

23/29

Logic Grouping and Merging<Logic hierarchy might be a bad partition for P&R to

minimize interconnect delays and chip area

<Components in critical path are scattered in many blocks => localized critical path

24/29

Minimize Top-level Routing and Chip Area

<Use connectivity analysis to decide floorplan

<Need many iterations to make the design pad-limited

25/29

Global Pin Optimization

<Arrange pins of blocks to minimize interconnects

<Do P&R of each block according to its pin locations

26/29

Area Saving Using the Hierarchical CAD flow

<Reduce chip area by 36% (19 mm2 = $38K)

< 50% utilization rate on average vs. 20% in flat flow

27/29

Post-Layout Speed Results

11

254.8

9.7

0.1

0.1

0

5

10

15

20

25

30

35

40

Best-case

Worst-case

Interconnect

Gate

MAC

< Full-chip R/C extraction obtains interconnect delays= Max critical-path

interconnect delay is 92 ps=> P&R is doing a good job

< Back-annotate interconnect delay to obtain actual speed = 63 MHz under best-case

= 29 MHz under worst-case

= 30% - 50% of yield using TSMC 0.35um cells can achieve best-case timing

< MAC accounts for 70% of critical path gate delay= Need faster implementation

to improve speed

28/29

UTDSP Facts

M (N+8)/2 + 6M (N+9)/2 + 8M (N+6)/2 + 7Cycle count for FIR, N tap

M samples

167 – 200 MHz

85 MHz63 MHzMax Clock Speed

8107Number of FU

0.25um CMOS0.25um CMOS 0.35um CMOSProcess Tech

352-pin BGARTL core108-pin PGA

RTL core

Deliverable

Form

Full-customSynthesisSynthesisDesign Methodology

TI

TMS320C6201

Philip R.E.A.L DSP

UTDSP

29/29

Conclusions and Future Work

<Conclusions:=Designed and implemented a two-cluster instruction

packer that outperforms VelociTI packing scheme

=Designed and implemented an OO architecture simulator and GUI-based assembly debugger

=Defined a novel hierarchical CAD flow

=Presented the VLSI implementation of the UTDSP

<Future Work:=Build register files using SRAM compilers

=Implement in a CMOS tech with more metal layers

=Completely eliminate the design gap using JavaToRTL

30/29

Why we need architecture simulator

31/29

Speedup Comparison: UTDSP vs. TI VelociTI

Documents

UTDSP: A VLIW Programmable DSP Processor