31
VLIW DSP Processor Design for Mobile Communication Applications Contents crafted by Dr. Christian Panis Catena Radio Design

VLIW DSP Processor Design for Mobile Communication … · 2013. 1. 9. · Embedded processing of lower communication layers. Characteristica. Mix of traditional loop-centric DSP algorithms

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: VLIW DSP Processor Design for Mobile Communication … · 2013. 1. 9. · Embedded processing of lower communication layers. Characteristica. Mix of traditional loop-centric DSP algorithms

VLIW DSP Processor Design forMobile Communication Applications

Contents crafted byDr. Christian Panis

Catena Radio Design

Page 2: VLIW DSP Processor Design for Mobile Communication … · 2013. 1. 9. · Embedded processing of lower communication layers. Characteristica. Mix of traditional loop-centric DSP algorithms

Agenda

Trends in mobile communication

Architectural core features withsignificant impact on performance

Case study: 3a – a scalable VLIW architecture

Design space exploration

Challenges of scalability

Summary

Page 3: VLIW DSP Processor Design for Mobile Communication … · 2013. 1. 9. · Embedded processing of lower communication layers. Characteristica. Mix of traditional loop-centric DSP algorithms

Trends in Mobile Communication

IEEE Spectrum, July 2004

Page 4: VLIW DSP Processor Design for Mobile Communication … · 2013. 1. 9. · Embedded processing of lower communication layers. Characteristica. Mix of traditional loop-centric DSP algorithms

Trends in Mobile Communication

Embedded systems emerging increasingly Bandwidth demands leads to significant

increase in computational requirements Trade-off:

power dissipation vs. flexibility vs. performance

Cost pressure, feature size, application spaceMultistandard solutions

Application-specific and customizable processors

Page 5: VLIW DSP Processor Design for Mobile Communication … · 2013. 1. 9. · Embedded processing of lower communication layers. Characteristica. Mix of traditional loop-centric DSP algorithms

How to Tackle the Problem?

Application specific processor architectures provides support for application specific requirements provides domain specific problem solutions provides trade-off power vs. area effort vs. flexibility

Domain Specific Processor Architectures

Things to be considered For each core a seperate tooling/tool-chain? How to analyse the application specific requirements? How to analyse the gain compared with a standard core solution? How to deal with additional verification effort caused by flexibility?

Page 6: VLIW DSP Processor Design for Mobile Communication … · 2013. 1. 9. · Embedded processing of lower communication layers. Characteristica. Mix of traditional loop-centric DSP algorithms

Focus / Why VLIW ?

FocusEmbedded processing of lower communication layers

CharacteristicaMix of traditional loop-centric DSP algorithms with control code

load/store VLIW is one possible solution Real time requirements for signal processing algorithms (+) High ILP support allows efficient execution of inner loops (+) Code density drawback of VLIW (-) Poor cache support (+/-)

Page 7: VLIW DSP Processor Design for Mobile Communication … · 2013. 1. 9. · Embedded processing of lower communication layers. Characteristica. Mix of traditional loop-centric DSP algorithms

Architectural Key Characteristika

Register filesize, number of entries, structure

Data path(s)number, parallel availability, type

Memory ports / bandwidthnumber, data width, supported granularity

ISA, binary codinginstruction mapping, binary coding, native instruction word size

Pipeline structurenumber of stages, exceptions

Page 8: VLIW DSP Processor Design for Mobile Communication … · 2013. 1. 9. · Embedded processing of lower communication layers. Characteristica. Mix of traditional loop-centric DSP algorithms

Architectural Key Characteristika

Register filesize, number of entries, structure

d0d1

d3 d2

d29 d28

d31 d30

a0/l0

a1/l1

a14/l14

a15/l15

d0d1

d3 d2

a0/l0

a1/l1

a14/l14

a15/l15

D [0..31]L [0..15]A [0..15]

D [0..15]L [0..15]A [0..15]

Page 9: VLIW DSP Processor Design for Mobile Communication … · 2013. 1. 9. · Embedded processing of lower communication layers. Characteristica. Mix of traditional loop-centric DSP algorithms

Architectural Key Characteristika

Data path(s)number, parallel availability, type of supported functions

SIMD MUL

op1 op2

X

res 1 op3

+/ALU

res 3

op4 op5

X

res 2 op6

+/ALU

res 4

SIMD MUL

op1 op2

X

res 1 op3

+

res 3

op4 op5

X

res 2 op6

+

res 4

op1 op2 op3

ALU

res 5

op4 op5 op6

ALU

res 6

Page 10: VLIW DSP Processor Design for Mobile Communication … · 2013. 1. 9. · Embedded processing of lower communication layers. Characteristica. Mix of traditional loop-centric DSP algorithms

Architectural Key Characteristika

Memory ports / bandwidthnumber, data width, supported granularity

AGU 1 AGU 2

PORT 1 PORT 2

AGU 1 AGU 2

PORT 1 PORT 2

Page 11: VLIW DSP Processor Design for Mobile Communication … · 2013. 1. 9. · Embedded processing of lower communication layers. Characteristica. Mix of traditional loop-centric DSP algorithms

Architectural Key Characteristika

ISA, binary codinginstruction mapping, binary coding, native instruction word size

add op1, op2add op1, op2, op3

sub op1, op2sub op1, op2, op3

ISA

20 bits 20 bits 16 bits 16 bits

16 bits

number instructions

long instructionsbytes

710164

2185

710215

1850

20 bits 16 bits

16 bits 16 bits

Page 12: VLIW DSP Processor Design for Mobile Communication … · 2013. 1. 9. · Embedded processing of lower communication layers. Characteristica. Mix of traditional loop-centric DSP algorithms

Architectural Key Characteristika

Pipeline structurenumber of stages, exceptions

Load/store operation write backread op1 read op2

Load/store operation accu write back 1read op2b

read op1 read op2a write back 2

address calculation

address calculation 1

address calculation 2

address register update

Instruction Fetch Alignment Instruction

Decode Execute 1 Execute 2

Instruction Fetch Alignment Instruction

Decode Execute 1 Execute 2 Execute 3

Page 13: VLIW DSP Processor Design for Mobile Communication … · 2013. 1. 9. · Embedded processing of lower communication layers. Characteristica. Mix of traditional loop-centric DSP algorithms

3a: case study

Key architectural aspects Modified Dual Harvard load-store architecture RISC instruction set xLIW (scalable long instruction word) Orthogonal register file and ISA Destination register based predicated execution Instruction buffer for power efficient inner loop processing Scalable and configurable core architecture Architecture specified considering an optimizing C-compiler

Page 14: VLIW DSP Processor Design for Mobile Communication … · 2013. 1. 9. · Embedded processing of lower communication layers. Characteristica. Mix of traditional loop-centric DSP algorithms

3a: case study

C-compiler aspects Load/store architecture Orthogonal ISA Large uniform register files Functionality stored in ISA Simple issue rules

Mode dependent instructions Irregular instructions Implicit dependencies Modes for instruction sets

Page 15: VLIW DSP Processor Design for Mobile Communication … · 2013. 1. 9. · Embedded processing of lower communication layers. Characteristica. Mix of traditional loop-centric DSP algorithms

3a: case study

xLIW –scalable long instruction word

align unitprogram memory

inst0

decoder ports

inst1 inst2 inst3inst4 inst5 inst6 inst7inst8 inst9 inst10 inst11

inst12 inst13 inst14 inst15inst16 inst17 inst18 inst19

inst n-3 inst n-2 inst n-1 inst n

inst0 inst1

LD/ST LD/ST CMP CMP PSEQ

inst2

inst3 inst4

inst5

inst6 inst7

cycle m

cycle m+1

cycle m+2

cycle m+3

cycle m+4

Page 16: VLIW DSP Processor Design for Mobile Communication … · 2013. 1. 9. · Embedded processing of lower communication layers. Characteristica. Mix of traditional loop-centric DSP algorithms

3a: case study

Destination register based predicated execution

load/store load/store arithmetic arithmetic predicated execution

flag register file

Page 17: VLIW DSP Processor Design for Mobile Communication … · 2013. 1. 9. · Embedded processing of lower communication layers. Characteristica. Mix of traditional loop-centric DSP algorithms

3a: case study

Orthogonal register file incl. flag register file

data address flag

register file

gb d1 d0

l0

a0

data register

long register

accumulator register

r0m0 address register

modifier register

Page 18: VLIW DSP Processor Design for Mobile Communication … · 2013. 1. 9. · Embedded processing of lower communication layers. Characteristica. Mix of traditional loop-centric DSP algorithms

3a: case study

”3-phase” pipeline

Instruction Fetch Alignment Instruction

Decode Execute 1 Execute 2

Phase 1: fetch Phase 3: executePhase 2: decode

Page 19: VLIW DSP Processor Design for Mobile Communication … · 2013. 1. 9. · Embedded processing of lower communication layers. Characteristica. Mix of traditional loop-centric DSP algorithms

3a: case study

Things to be considered For each core a seperate tooling/tool chain? How to analyse the application specific requirements? How to deal with additional verification effort caused by flexibility? How to analyse the gain compared with a standard core solution?

Scaleable core architecture – Adaptable to application specific requirements Scalable in performance

”one core” – ”one tool chain”

Page 20: VLIW DSP Processor Design for Mobile Communication … · 2013. 1. 9. · Embedded processing of lower communication layers. Characteristica. Mix of traditional loop-centric DSP algorithms

Design Space Exploration – Design Flow

functional testing

functional testing

hardware generators

hardware generators

configuration nconfiguration 2

optimizing C-compiler

application C-code

assemblerlinker

ISS

static analysis results

dynamic analysis results

verification report

compiler generator

configuration 1

testcase generator

documentationgenerator

Evaluation Phase

Production Phasebincode

generatorchosen core configuration

documentationgenerator

hardware generators

optimizing C-compiler

assemblerlinker

ISS

static analysis results

dynamic analysis results

verification report

compiler generator

testcase generator

application C-code

binary executable

Page 21: VLIW DSP Processor Design for Mobile Communication … · 2013. 1. 9. · Embedded processing of lower communication layers. Characteristica. Mix of traditional loop-centric DSP algorithms

Design Space Exploration – Static Analysis

Code SizeMeasure how efficient the application can be mapped on aprocessor in term of required code space

ParallelismMeasure how efficient the application code can be mappedon a parallel architecture

Instruction histogramMeasure how frequent instructions are used duringmapping of the application code onto the chosen ISA

Page 22: VLIW DSP Processor Design for Mobile Communication … · 2013. 1. 9. · Embedded processing of lower communication layers. Characteristica. Mix of traditional loop-centric DSP algorithms

Design Space Exploration – Dynamic Analysis

Program memory fetchMeasure of efficient use of memory fetches, mainly influenced bypipelined processors and application code with low branchdistance and high branch frequency

Execution count per bundleMeasure how often a certain execution bundle will be executed

Execution count per instructionMeasure how frequent a certain instruction will be executed

Page 23: VLIW DSP Processor Design for Mobile Communication … · 2013. 1. 9. · Embedded processing of lower communication layers. Characteristica. Mix of traditional loop-centric DSP algorithms

Design Space Exploration – Example

Page 24: VLIW DSP Processor Design for Mobile Communication … · 2013. 1. 9. · Embedded processing of lower communication layers. Characteristica. Mix of traditional loop-centric DSP algorithms

Design Space Exploration – Statistics

Page 25: VLIW DSP Processor Design for Mobile Communication … · 2013. 1. 9. · Embedded processing of lower communication layers. Characteristica. Mix of traditional loop-centric DSP algorithms

3a: case study

Things to be considered For each core a seperate tooling/tool chain? How to analyse the application specific requirements? How to analyse the gain compared with a standard core solution? How to deal with additional verification effort caused by flexibility?

Application code analysis Detailed static & dynamic analysis Quantitative analysis of different core based platforms Balance different core features against area/power consumption Quantitative support to optimize HW/SW partitioning Identify ”Hot Spots”

Page 26: VLIW DSP Processor Design for Mobile Communication … · 2013. 1. 9. · Embedded processing of lower communication layers. Characteristica. Mix of traditional loop-centric DSP algorithms

3a: case study

Benchmarking?

Does the application requirements fit to standard benchmarks?

Application Benchmarking:Benchmarking of theTarget Architecture

Page 27: VLIW DSP Processor Design for Mobile Communication … · 2013. 1. 9. · Embedded processing of lower communication layers. Characteristica. Mix of traditional loop-centric DSP algorithms

Design Space Exploration

Page 28: VLIW DSP Processor Design for Mobile Communication … · 2013. 1. 9. · Embedded processing of lower communication layers. Characteristica. Mix of traditional loop-centric DSP algorithms

Design Space Exploration

Page 29: VLIW DSP Processor Design for Mobile Communication … · 2013. 1. 9. · Embedded processing of lower communication layers. Characteristica. Mix of traditional loop-centric DSP algorithms

Design Space Exploration

Things to be considered For each core a seperate tooling/tool chain? How to analyse the application specific requirements? How to analyse the gain compared with a standard core solution? How to deal with additional verification effort caused by flexibility?

Analysis application code on ”function” level Compare MIPS/Memory requirements for different application

setup’s and for different core architecture Benchmarking for the target architecture

Page 30: VLIW DSP Processor Design for Mobile Communication … · 2013. 1. 9. · Embedded processing of lower communication layers. Characteristica. Mix of traditional loop-centric DSP algorithms

Challenges of scalability

Verification effort versus Flexibility

XML based configuration file

Binary code generator

Documentation generator/adaptation

HW code generator

Testcase generator

Page 31: VLIW DSP Processor Design for Mobile Communication … · 2013. 1. 9. · Embedded processing of lower communication layers. Characteristica. Mix of traditional loop-centric DSP algorithms

Summary

Application Specific Processors allows to meetarea and power dissipation requirements inSoC’s for mobile communication platforms

Multistandard requirement leads todomain specific processor architectures

”one core” – ”one tool chain”

Design Space Exploration is required to analysedomain specific requirements on core subsystem