Upload
others
View
19
Download
0
Embed Size (px)
Citation preview
1/29
UTDSP: A VLIW Programmable DSP Processor
Sean Hsien-en Peng
Department of Electrical and Computer Engineering
University of Toronto
October 26th, 1999
2/29
Outline
<Background and Motivation=Tightly-encoded vs. VLIW
<A Novel Long-Instruction Packing Scheme=Better than TI VelociTI packing scheme
<Architecture Simulator & GUI Debugger=Minimize design gap, optimize kernel code
<System Design and VLSI Implementation=Design capture and a novel CAD flow
<Conclusions and Future Work
3/29
DSP: Need High Performance & Low Costs
<Specialized uProcessor for DSP algorithms
<Tightly-encoded ISA=Small storage/memory bandwidth
=Accumulator based Instructions
=Not suited for HLL compilers
<Alternative: VLIW architecture=Multiple FUs => High Performance
=Static Scheduling => Low cost
=Easy targets for compilers to exploit ILP
=UTDSP uses VLIW Architecture
4/29
UTDSP System
5/29
Current Limitations for VLIWs in Cost-Sensitive Systems
< Instruction memory is substantially increased=Unused encoding slots
=Loop unrolling to exploit ILP
<Very high instruction fetching bandwidth=A severe problem when off-chip instruction memory is
used and pin-count, packaging are major constraints.
6/29
TI VelociTI Architecture
< Instruction Packing=Reduce Storage
Requirements
=Unable to reduce bandwidth
=Use of off-chip memory degrades performance
= Introduce crossbar delay
7/29
UTDSP Two-Level Instruction Fetching
< Reduce instruction bandwidth to 32 bits=Enable the use of off-
chip instruction memory.
< Minimize decoder memory (90:10 rule)
< Eliminate the use of crossbar and extra decoding logic.
< Allow a novel packing method to reduce storage requirement
8/29
UTDSP Packing Method
9/29
Denser Packing using Two Clusters
10/29
Slot Sharing: Achieve Ideal Packing
< Same operations share one slot.
< Further reduce decoder memory size of two-cluster packing by 10%
< Achieve better-than-ideal packing result= Ideal: No NOPs in
decoder memory
< No extra hardware< Further improve the
result using clever register allocation
11/29
Implementation of UTDSP Packer Software System
12/29
Packing Performance: UTDSP vs. VelociTI
< Better packing rate than VelociTI algorithm
< Reduce instruction bandwidth to 32 bits=Can use off-chip
instruction memory
< No crossbar needed
< Only need 27% storage overhead to have the speedup of VLIW design= 6 in hand-crafted
kernels, 2.6 on average
Average Storage Requirement
12 Compiler-Generated benchmarks
1
1.31.27
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Un
i-OP
UT
DS
PP
acker
Velo
ciTI
13/29
Architecture Simulator and GUI Assembly Debugger
< Architecture simulator evaluates design trade-offs in early stage
< GUI-based assembly debugger helps assembly coding
14/29
OO Model Using Java: Minimize Design Gap
< An Object can model any class of automata in Chomsky hierarchy= Data members model state
memory
= Class methods model state transitions
< Build corresponding digital hardware by 1-1 mapping= Translate object partition to
digital component partition
= Translate message protocol to signal interface
< Identical partition and block interface between two models are obtained
15/29
GUI-based Assembly Debugger is Free
< Adding self-displaying, event-listening abilities to the objects in UTDSP OO model => Debugger need not be modified when there is a change made to the architecture simulator
16/29
UTDSP Hardware Design
17/29
Combine EX/MEM to Reduce RAW and Bypassing
<Bypassing becomes a costly function in VLIW=Complexity of comparators/bypassing buses is
O(dn2), n EUs, d = # of stages between ID and WB
<Combine EX/MEM to reduce RAWs and bypassing hardware by 50%=All the RAW hazards can be solved using bypassing
<Trade-off: Eliminate displacement mode to not increase pipeline latency=MUs initiate memory access from very beginning of
EX stage
18/29
UTDSP Instruction Set: 69 Instructions<Memory Instructions: (MU1, MU2)
=Address Indirect addressing mode=MU1 is associated with DataMem X, MU2 with DataMem Y
<Address Instructions: (AU1, AU2)=Modulo addressing mode=Use start/end register implementation to reduce # of
ports on register files
<Integer Instructions: (DU1, DU2)=MAC instructions for integer and 1.15 fixed-point format
<Control Instructions: (PCU)=Branch, Jump, JSR ( 2 cycles branch delay )=Zero-overhead hardware loops ( nested up to 5 levels )
4Allow branches, interrupts in the inner loops
19/29
Optimize Kernel Performance Using Hardware Loops< Zero-overhead, nested up to 5 level, interruptable.
< for (I=0;I++;I<1000) { r3 = r2 + r1; r5 = r3 + r4 }
20/29
Design Challenge: The PC Unit< Handles not only RISC tasks but also nestable,
interruptable, zero-overhead hardware loops
21/29
VLSI Implementation: Challenge in Design Capture and CAD methodology
< Design Capture=Pipelined design, modeled in 10,100 lines RTL VHDL
=Use SRAM macros from CMC to build memory blocks
< Need a novel Hierarchical CAD flow= Interconnect delays decrease at 50,000-gate block level
as feature size shrinks
4Need logic grouping and merging to partition the chip and localize critical paths
=Minimize top-level routing between blocks
4Need global pin optimization
= Increase routing utilization rate
4Need hierarchical P&R flow and area-based router
22/29
Hierarchical CAD Flow
23/29
Logic Grouping and Merging<Logic hierarchy might be a bad partition for P&R to
minimize interconnect delays and chip area
<Components in critical path are scattered in many blocks => localized critical path
24/29
Minimize Top-level Routing and Chip Area
<Use connectivity analysis to decide floorplan
<Need many iterations to make the design pad-limited
25/29
Global Pin Optimization
<Arrange pins of blocks to minimize interconnects
<Do P&R of each block according to its pin locations
26/29
Area Saving Using the Hierarchical CAD flow
<Reduce chip area by 36% (19 mm2 = $38K)
< 50% utilization rate on average vs. 20% in flat flow
27/29
Post-Layout Speed Results
11
254.8
9.7
0.1
0.1
0
5
10
15
20
25
30
35
40
Best-case
Worst-case
Interconnect
Gate
MAC
< Full-chip R/C extraction obtains interconnect delays= Max critical-path
interconnect delay is 92 ps=> P&R is doing a good job
< Back-annotate interconnect delay to obtain actual speed = 63 MHz under best-case
= 29 MHz under worst-case
= 30% - 50% of yield using TSMC 0.35um cells can achieve best-case timing
< MAC accounts for 70% of critical path gate delay= Need faster implementation
to improve speed
28/29
UTDSP Facts
M (N+8)/2 + 6M (N+9)/2 + 8M (N+6)/2 + 7Cycle count for FIR, N tap
M samples
167 – 200 MHz
85 MHz63 MHzMax Clock Speed
8107Number of FU
0.25um CMOS0.25um CMOS 0.35um CMOSProcess Tech
352-pin BGARTL core108-pin PGA
RTL core
Deliverable
Form
Full-customSynthesisSynthesisDesign Methodology
TI
TMS320C6201
Philip R.E.A.L DSP
UTDSP
29/29
Conclusions and Future Work
<Conclusions:=Designed and implemented a two-cluster instruction
packer that outperforms VelociTI packing scheme
=Designed and implemented an OO architecture simulator and GUI-based assembly debugger
=Defined a novel hierarchical CAD flow
=Presented the VLSI implementation of the UTDSP
<Future Work:=Build register files using SRAM compilers
=Implement in a CMOS tech with more metal layers
=Completely eliminate the design gap using JavaToRTL
30/29
Why we need architecture simulator
31/29
Speedup Comparison: UTDSP vs. TI VelociTI