Compilers and System Software for Embedded DSP …mdker/courses/Seminar2008/Week03.pdf經濟部學界開發產業技術計畫 Compilers and System Software for Embedded DSP Processors

學界開發產業技術計畫經濟部

Compilers and System Software for Embedded DSP Processors

全程計畫：自94年02月01日至97年01月31日止國立清華大學積體電路設計技術研發中心

國立交通大學電子資訊研發中心

李政崑教授

2

嵌入式系統軟體商機

Marketing, Design, Library Control, EDA Tools and

Silicon

Marketing, Design, Library Control,

EDA Tools

Marketing, Design

Marketing

9.4%10,737

74.7%85,726

8.4%9,693

7.5%8,644

Late Adopter

9.8%18,918

45.0%86,868

31.0%59,843

14.2%27,412

Late Adopter

Upper Mainstream

Lower Mainstream

Power Users (Seats)

Upper Mainstream

Lower Mainstream

Power Users (Seats)

Marketing, Design, Library Control, EDA Tools and Silicon and

SoftwareMarketing, Design,

Library Control, EDA Tools and

Silicon

Marketing, Design, Library Control,

EDA Tools

Marketing, Design

Marketing

Semiconductor Design Pyramid, 2002(左圖), 2005(右圖) Source: Forecast: Embedded Software Tools, Worldwide 2004-2009, Gartner (Aug. 2005)

Why will software be a concern tohardware people?

Case Study 1: Low Power Decisions Are Made Very Early in the Design FlowCase Study 1: Low Power Decisions Are Made Very Early Case Study 1: Low Power Decisions Are Made Very Early in the Design Flowin the Design Flow

Power Reduction Percentage

Production

Place & RouteClock-tree, gate-level

RTL SynthesisClock gating, μArch, multi-Vt

ArchitectureMulti-Voltage islands, sleep mode, etc.

System DesignAlgorithms, IP, S/W, etc..

Design

Implementation

Source: Cadence Encounter Low-Power Design Flow Seminar and Workshop

Case Study 2:Audio DSP Addressing

Case Study 2:Case Study 2:Audio DSP Audio DSP AddressingAddressing

I registers• 14-bit addressing• Contain the actual address used to access memoryM registers• 14-bit addressing• Used for post-modify schemeL registers• 14-bit addressing• Provide the modulus logic with the length information

Memory Addressing Space• 14-bit(16k) addressing

24-bit(16M)addressing

• I, M, L registers NOchanges (all 14-bit size)

• Apply two “page”registers for addressing

DMPAGE – 10-bitPMPAGE – 10-bit

We DO NOT know when does the carry happen even if we provide DMPAGE and PMPAGE registers for 10-bit extra addressing

DMPAGE I,M,L register01323

Better Solution for Memory ExtensionBetter Solution for Better Solution for Memory ExtensionMemory Extension

Better solution 1. Hardware automatically complete carry

calculation for “page” registers2. No “page” registers. Instead, extend I,M,L

registers to 24-bit or design I’, M’, L’ to be 24-bit

Case Study 3:Saturation Instruction

Case Study 3:Case Study 3:Saturation InstructionSaturation Instruction

Instruction 1, Saturation-Yes-NoInstruction 2, Saturation-Yes-NoInstruction 3, Saturation-Yes-NoInstruction 4, Saturation-Yes-No

Saturation-onInstruction 1Instruction 2Instruction 3Instruction 4Saturation-off

Code-1Saturation-onInstruction 1Instruction 2Instruction 3Code-2Instruction 4Saturation-off

Compiler Intrinsic FunctionCompiler Intrinsic FunctionCompiler Intrinsic Function

C level functions with several assembly instructions and expanded into compiler intermediate instruction. L_SHR

L_SHL

L_SUB

L_ADD

L_MSU

L_MAC

ROUND

L_MULT

SATURE

Code-1Saturation-onInstruction 1Instruction 2Instruction 3Instruction 4Saturation-offCode-2

Code-1SatureCode-2

ESWESW重點技術項目重點技術項目平台技術與嵌入式軟

體

創新應用系統及軟體

(數位生活)數位家庭應用軟體

娛樂、資訊、控制、服務等應用。

個人可攜式消費性電子產品

通訊、娛樂、資訊等行動裝置應用。

不限定國內外嵌入式系統

AP1

AP2

AP3

AP4

AP5

AP6

AP7

TWMPUs

DSP,Multi-DSP

TW Platform

Compiler RT OS

Embedded Software

Star IP ProgramPerspective: To coordinate resources from the industry, universities, and research organizations in proactive development of architecture and integrated software, building upTaiwan’s industrial leadership in Silicon IP

Basic concepts:

CPU/DSP/MPU

Innovative IPs for multi-band and digitalization are required for future wireless/mobile products

Multimedia IPs

NetworkingInterconnection/Bus/Memories core

StarIP Programs

☆

☆

☆

☆

Key Techniques in Chip Systems

Transmission Links in Chip Systems Ultra Low Power

(ULP) DSP Core

☆

Messenger – Distributed Radio Transmitter System

Low-Power Dual-Processor System

A-Core

SunplusS-Core

STCPAC

ESW 系統及應用參考平台

Multi-Core Micro Kernel

GSM, 3G, WiMAX

Multi-Core,MPU + VLIW DSP

Ethernet, 1394

DMA / Memory Control

Multi-Core interconnection

USB, HDMI

ESL: Virtual Platform Design

Multimedia Accelerator

Applications

Multi-core SoCPlatform

System Kernel Software

Multi-Core IPC

Power Mgmt

DSP Middleware

Streaming Models

DirectFB, OpenGL ES

Network protocol, SIP, RTP

MiddlewareGUI, 3D, JSR-184/239DLNAJVM, OSGi

Offi

ce, P

DF

Med

ia

Play

er

Imag

e Vi

ewer

Secu

rity

SW

DR

M

GPS

Nav

.

Mai

l

Web

B

row

ser

VOIP

, IM

Java

AP,

Jav

a 3D

P2P Stream Server

Web Service

Web Container

Multimedia Portal

Digital Camera

Wireless

Transcoding Biometrics Crypto Engine

Graphics AV CODECs Baseband

Embedded Linux + RTOS

Multi-Core IDE and Debugger

Multi-Core Compiler Toolkits

APIs

學界開發產業技術計畫經濟部

計畫架構

前瞻高效能低耗能之雙處理器系統技術研發計畫

權重： 100%

A. 系統軟體研發與設計(李政崑教授,清大資工)

C.作業系統與應用程式之研發(林大衛教授,交大電子)

B.平行訊號處理器晶片開發與電路設計(黃威教授,交大電子)

(人力、經費)第一年度 29.20% 33.26%第二年度 29.20% 33.26%第三年度 29.20% 33.26%

(人力、經費）第一年度 37.37% 33.17%第二年度 37.37% 33.17%第三年度 37.37% 33.17%


1.超長指令集數位訊號處理器編譯器之設計與研發(李政崑教授,清大資工)(黃元欣教授,海大資工)

2. 數位訊號處理器相關函式庫之研發及系統效能評估(石維寬教授,清大資工)(李政崑教授,清大資工)

1. 物件導向執行環境(陳俊穎教授,交大資科)

2. Low Power Logic and Memory Units(黃威教授,交大電子)

1. Extensible and Energy-Aware Micro-architecture(劉志尉教授,交大電子)

3. Interface Circuit Design and Bit- CoprocessingAccelerator(張添烜教授,交大電子)

2. 多媒體演算法之系統分析及設計研發(林大衛教授,交大電子)(賴尚宏教授,清大資工)

4. 雙處理器開發工具整合與測試之系統規劃(李政崑教授, 清大資工)

3. 雙處理器系統展示平台之開發與規劃(分包工研院STC)


前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發

94/2 5 8 11 95/2 5 8 11 96/2 5 8 11

全程計畫產出時程表全程計畫產出時程表計畫產出

Debugger(2.0)

Compiler(3.0)

PACDSP 2.0 PACDSP 3.0

Debugger(3.0)

Microkernel(3.0)

Assembler/linker(2.1)

Assembler/linker(3.0)SIMD/sub-word/GRA

NewLib

Dual-core Programming model

Compiler Optimizations

Microkernel

Compiler Toolchain

Multi-mode On-demand SRAMAsynchronous FIFO in GALS

Microkernel(2.0)

Embedded DRAM

Architecture

Low-Power

Pica 2.0 Pica 3.0 Multi-mode SARM for Virtual Cluster DSP

Multicore DSP Interface

KVM on PAC

GALS-based FFTMultiple Supply Voltage System

Frame-based MPEG4 decoder

JVM with JIT compiler on PAC

Efficient Intra-Prediction

Fast-Intermode Decision(static Learning)

Simple ME Object-based MPEG4 decoderFast Intra-Prediction

Intra-Prediction on PACFast Motion Estimation

Java Environment

Video CodingObject-based MPEG4 encoder

Frame-based MPEG4 encoder

Multi-LevelPower Management

Low Power TCAM(0.26fj/bit/Search) Controllable Dual-PortLow Power SRAM

DFVM for Energy-Aware FFT

Streaming Programming model

GRA/SIMD Compilers

NewLib on PMP SoC

2

ESL Platform

Toolkits for CIC

Controllable SRAMfor Virtual Cluster DSP

Multi-threaded DSP Interface

CVM/KVM/MIDP 2.0 BenchmarkingAOT Compilation Prototype

前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發 15

2007.3Enable PAC DSP System Newlib


PAC3.0

PAC3.0

PAC2.1

PAC2.12005.11

PAC DSP 2.1 Assembler/Linker Available2005.11

PAC DSP 2.1 Assembler/Linker Available 2005.12PAC DSP 3.0 Document Available2005.12PAC DSP 3.0 Document Available

2005.9PAC DSP 2.1 Document Available2005.9PAC DSP 2.1 Document Available

2005.9PAC DSP 2.0 Debugger Available


2005.10PAC DSP 2.0 Microkernel Available


PAC2.0

PAC2.0

PAC3.1

PAC3.1

2006.2 PAC DSP 3.0 ISS Available2006.2 PAC DSP 3.0 ISS Available2006.3PAC DSP 2.0 Real Chip Demo2006.3PAC DSP 2.0 Real Chip Demo2006.4

PAC DSP 3.0 Compiler Available2006.4

PAC DSP 3.0 Compiler Available

2006.5PAC DSP 3.0 RTL Freeze / Tape-out2006.5PAC DSP 3.0 RTL Freeze / Tape-out



2006.10PAC DSP 3.0 Real Chip Testing Passed2006.10PAC DSP 3.0 Real Chip Testing Passed

2006.2PAC DSP 3.0 Assembler/Linker Available

2006.2PAC DSP 3.0 Assembler/Linker Available



2006.5PAC DSP 3.0 Software/Hardware Testing

Criteria Passed

2006.5PAC DSP 3.0 Software/Hardware Testing

Criteria Passed

2007.1PAC DSP 3.0/3.1 Optimizing Compiler Available


2007.1PAC PMP SOC Tape-outARM+DSP H.264 VGA Realtime Demo Passed


2007.4SIMD with Sub-word Instructions/Global RFA


2007.6Programming Model Design for PAC II

2007.6Programming Model Design for PAC II

Tool Chain: Compiler & MicroTool Chain: Compiler & Micro--KernelKernel

前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發 16



PAC3.1

PAC3.1







2007.6Streaming RPC Programming Model Design for PAC II

2007.6Streaming RPC Programming Model Design for PAC II

Tool Chain: Compiler & MicroTool Chain: Compiler & Micro--Kernel Kernel (Continue(Continue--2008)2008)

2008. 01PAC DSP Toolkits Delivered for CIC

2008. 01PAC DSP Toolkits Delivered for CIC

2007.11PAC DSP Debugger on PAC PMP SoC Board

2007.11PAC DSP Debugger on PAC PMP SoC Board

2007.10PAC PMP SoC Board Available2007.10PAC PMP SoC Board Available

2007.9Pass EEMBC Telecom Test-suite

2007.9Pass EEMBC Telecom Test-suite

2007.10Pass Major MiBench Test-suite

2007.10Pass Major MiBench Test-suite

2008. 02Enable PAC DSP newlib support PAC PMP SoC Board

2008. 02Enable PAC DSP newlib support PAC PMP SoC Board

2008.3PACDSP+AndeSCore Dual-Core ESL platform

2008.3PACDSP+AndeSCore Dual-Core ESL platform

2007.8Communication APIs

2007.8Communication APIs

2008. 03Multi-core communication API runtime

2008. 03Multi-core communication API runtime


Development Software

Middleware

Dual-Core OS/Architecture/ESL

科專產出科專產出: : 整合多媒體雙核心應用平台整合多媒體雙核心應用平台

AV.CodecsJava VM

pCoreLinux

RPC

Streaming PRC

DSP Lib.

OpenMAX IL

ESL: Virtual PlatformMPU+PACDSP

OpenMAX DLCompiler Toolkits

ApplicationGUI

SIP Phone Image ViewerMedia PlayerJava AP

Web Browser

透過網路達成VoIP的電話功能

提供PACDSP H.264及MPEG4 Codecs

提供網路瀏覽應用

相片瀏覽功能

影像播放功能

提供雙核心彼此間的呼叫與資料傳遞

提供NewLib，作為C語言函式庫

提供完整的軟體開發環境(IDE, Compiler,

Debugger, Assembler, Binutilus, …)

提供Linux 2.6的作業系統環境

雙核心高效能硬體平台(A3, STC技術支援)

系統層級的模擬環境(衍生計畫產出)

提供OpenMax IL層的多媒體呼叫，以及

OpenMax DL層的函式庫設計

可執行Java應用程式，提供Java VM

提供Streaming方式的雙核心RPC資料傳遞

藉由pCore(DSP微核心)，提供多重函式呼

叫與工作切換

PACDSP Kernel

From ITRI STC

Clustered Architecture

Scalable Clustered ArchitectureScalable computation power fits the different application requirement

Distributed Register FileDistributed register file reduce the area and power consumption

Cluster From ITRI STC

Distributed Register File- A, PP,C and AC Register File

Ping-pong Register FileReduce Move OperationReduce the port number in each register file

Constant Register FileReduce Load/Store OperationReduce the power consumption in coefficient access

From ITRI STC

Clustered Data Path and Distributed Register File- Evaluation Result

Compare with Centralized Register File

Area : 76.8% area are saved Access Time : 46.9% access time are saved

From ITRI STC

DSP Compilers

• Heterogeneous Registers• Addressing Modes• Limited Connectivities• Local Memories• Harvard Architecture• Sub-word/SIMD Performances


Key Compiler TechnologiesVLIW DSP compilers for distributed Register FilesPALF scheduling policies for ILP (CPC 2006)SA-based scheme for ILP (LCPC 2005)GRA scheme for distributed register files (CPC 2007)Copy propagations for distributed register architectures (LCPC 2006)SWP flow for distributed register file architectures (ACM LCTES 2007)SIMD compiler optimizationsEnable compiler + intrinsics/extrinsicsCompiler for low-power schemes (ACM EMSOFT 2005/ACM TODAES 2007)

Irregular Register File Usage

Local RFAccessible Only by Dedicated FU

Global RFShared by M-I PairAccess Limited byPing-pong Switches

Constant RFWritable only by M-Unit1 Read-only Access Per UnitOnly Usable by Some Instructions

M-Unit

A Registers

M-Unit

A Registers

I-Unit

AC Registers

I-Unit

AC Registers

B-Unit

R Registers

M-Unit M-Unit

I-Unit I-Unit

D Registers D Registers

M-Unit M-Unit

I-Unit I-Unit

CRegisters

CRegisters

Maximal 2 Read Ports + 1 Write PortLow cost and low power

(Compared to TI C6: 10 Read Ports + 6 Write Ports)

1. Register Allocation for PAC DSP

How to adapt register allocation to the features of irregular register files?

Various RFs for each FUPing-pong Bank SwitchesComplex Register File Communication

Our proposed solutions:SA-RA: A simulated-annealing approach (LCPC 2005)

Random based nondeterministic iterative searchPALF-RA: A more direct and faster heuristic approach

PALF Register Allocation Scheme (CPC 2006)

Solution 1:Simulated-Annealing (SA) Based Register Allocation Approach

Motivation:Complex interference from:

We appreciate a machine-learning method to give a near-optimal results.To be a base reference for developing a faster heuristic method!

Register Allocation

InstructionScheduling

Code Insertionfor Distributed Register

Communication

To Determine: Virtual Register Register File (Bank)

Input: un-scheduled instructionsOutput: a schedule of the instructions

a register file assignment (RFA) map

RFA map = {(v1, f1), (v2, f2), ...} Where vi : a virtual register, fi : a register file (bank)

• Setup SA:1. An initial random RFA map2. sched_len = PAC_Scheduler ( initial RFA map )3. SA control variables:

• threshold• p_test: a probability test value (0 < p_test < 1).• energy: initial value > threshold.

PAC_Scheduler:Graph-coloring based register allocation according to the RFA mapInstruction scheduling and code insertion for register file communication

To Optimize: Scheduling Result

Randomly change:a mapping (vi, fi)

Re-run:new_schedule_len =

PAC_Scheduler (new RFA map)

new RFA map

Better result test:new_schedule_len < schedule_len

energy--schedule_len =

new_schedule_len

Random test:a random number > p_test

energy++

yes

yes

no

no

new RF

A map

old RFA map

SA stop test:energy > threshold

yes

FinalRFA map

&schedule

no

Solution 2:Concepts for PALF Register Allocation Scheme

Which RF to be allocated?

How to utilize ping-pong banks?

How to utilize clusters?

LessUsability

More Scheduling Interference

LocalRF

ConstantRF

GlobalRF

HigherPriority

To maximize parallelism between banks! Hard to determine without scheduling

If the results are not worse than using global RF,why not use local RF instead?

M-Unit I-UnitOnly allocate Global RF for:

To partition operations into two clusters if it worth!

PALF Register Allocation Scheme

MaximalLocalization

Register FileAssignment

Ping-pongRegister BankAssignment

ClusterAssignment

CommunicationCode Insertion

Post-passRegister

Allocation

BuildCRTA-DDG

2-Cluster Code?Yes

No

Ping-pong Aware Local Favorable (PALF)To allocate local RF firstTo assign ping-pong banks for minimizing interference

PALF determines RF allocation, then applies usual register allocation to each RF

Experiment Platform

EnvironmentPACDSP Compiler (using ORC infrastructure)PACDSP binutils(modified GNU binutils)Instruction Set Simulator

Cycle accurate

BenchmarkDSP-stone

System-SoftwareDevelopment Suite

Profiler

Debugger

Libraries

Assembler Linker

C/C++ Compiler

InstructionSet

Simulator

DSPstone Testing Patterns -Speedup


2. Enable GRA (Global Register File Assignment) 2. Enable GRA (Global Register File Assignment) for PAC Architectures for PAC Architectures

•• Extends Local (single block) RFA to Global Extends Local (single block) RFA to Global RFARFA

•• Block prioritization for Global RFABlock prioritization for Global RFA–– Based on loop depth, scheduling length, Based on loop depth, scheduling length,

frequency (static or profiled), etc.frequency (static or profiled), etc.–– Prioritizing more Prioritizing more ““importantimportant”” blocks that have blocks that have

greater effects on performancegreater effects on performance–– Try to minimize modification to the Try to minimize modification to the RFAsRFAs of the of the

blocks with higher priority. That is, keep them blocks with higher priority. That is, keep them untouched as possibleuntouched as possible

•• A LocalA Local--Conscious Global Register Conscious Global Register AllocatorAllocator for VLIW for VLIW DSP Processors with Distributed Register Files, DSP Processors with Distributed Register Files, ChiaChia Han Han Lu, YoungLu, Young--JiaJia Lin, YiLin, Yi--Ping You, Ping You, JenqJenq--KuenKuen Lee, CPC Lee, CPC 2007 (Compilers for Parallel Computing 2007), Lisbon, 2007 (Compilers for Parallel Computing 2007), Lisbon, Portugal, July 9Portugal, July 9--11, 200711, 2007


Example of Global RFAExample of Global RFA

BB1

BB2

BB4

BB3

BB5 BB6

BB7

BB8


Experimental Result of Global RFAExperimental Result of Global RFA


PACDSP SubPACDSP Sub--word Instructionsword Instructions

• PAC DSP Sub-word instructions– Three execution modes

• Single: ADD Rd, Rs1, Rs2 (one 32-bit)• Dual: ADD.D Rd, Rs1, Rs2 (two 16-bit)• Quad: ADD.Q Rd, Rs1, Rs2 (four 8-bit)

– Sub-word calculation (3 modes)

32 bit32 bit

32 bit32 bit

32 bit32 bit

+

=

16 bit16 bit+

=

16 bit16 bit

16 bit16 bit 16 bit16 bit

16 bit16 bit 16 bit16 bit

88 88 88 88++

=88 88 88 88

+ + +

= = = =

88 88 88 88Single Dual Quad


Performance for SIMD on PAC CompilerPerformance for SIMD on PAC Compiler

0

0.5

1

1.5

2

2.5

bkfir lms ssfir vecadd vecdot

Spe

edup

O0+ILO0+SIMDO1+ILO1+SIMDO2+ILO2+SIMD

0

2

4

6

8

10

12

14

16

18

bkfir lms ssfir vecadd vecdot

Spee

dup O0

O1+ILO2+ILO2+SIMD

PS. IL means IPA and LNO optimization

Fig. Improvem

ent for each optim

ization levelFig. O

verall Performance

Dspstone performance


Compiler Performance with G.723.1a Application

G.723.1a Decoder

0

2

4

6

8

10

12

TI O0 TI O1 TI O2 TI O3 PACC O0 PACC O1 PACC O2 PACCO2+IPA+WOPT

PACC withExtrinsic

Cyc

le N

orm

aliz

ed to

TI O

3

G.723.1a Encoder

0

2

4

6

8

10

TI O0 TI O1 TI O2 TI O3 PACC O0 PACC O1 PACC O2 PACCO2+IPA+WOPT

PACC withExtrinsic

Cyc

le N

orm

aliz

ed to

TI O

3


Compiler with SIMD Intrinsic/Extrinsic Support

Extrinsic

Register pair (64-bit)SWAP4E

Fix-pointSWAP4

SaturatedUNPACK4(U)

Intrinsic supportedRORPACK4

ROLUNPACK2(U)

INSERTISWAP2

L_SHRINSERTPERMH4

L_SHLXDOTP2EXTRACTI(U)ADDC(U)PERMH2

L_SUBSAA.QDOTP2EXTRACT(U)ABS.(D/Q)LIMB(U)CP

L_ADDRNDXMSU.DSRAI.(D)NEGLIMHW(U)CP

L_MSUBF.(D)XMAC.DSRA.(D)MERGESLIMW(U)CP

L_MACSFRA.(

D)XFMAC.(D)XMUL.DSRLI.(D)MERGEACOPY

ROUNDLMBDMSU.DXFMUL.(D)SRL.(D)SUB(U).(D/Q)MOVI(U).H

L_MULTCLSMAC.DMUL.DSLLI.(D)ADDI.(D)(D)MAX(U).(D/Q)MOVI(U)

SATURE(D)CLRFMAC(uu/us/su).(D)FMUL(uu/su/us).(D)SLL.(D)ADD(U).(D/Q)(D)MIN(U).(D/Q)MOVI.L

ExtrinsicSpecialMultiply & AddMultiplicationBit

ManipulationArithmeticComparisonData

Transfer


MiBenchMiBench Performance for PAC CompilerPerformance for PAC Compiler


EEMBC PerformanceEEMBC Performance


Compiler for Low-Power with Power Gating Instruction

0%

5%

10%

15%

20%

25%

30%

35%

Code

Siz

e G

rout

h

complex_m

ultiply

complex_u

pdateconv

olution

dot_produc

tfir2di

m fir

irr_biquad

_N_sectio

ns

irr_biquad

_one_sec

tion lmsmatri

x 1x3 matri

x

n_comple

x_updates

n_real_up

datesreal_

updateavera

ge

CADFE CADFE w/Sink-N-Hoist

To turn off useless components in processors. Use compiler analysis techniques to analyze program behaviorsCompilers insert power-gating instructions and try to merge those instructions.

( )

( )

( )

( )

( ) ( )

( ) ( ) ( ) ( )

( ) ( )

( ) ( ) ( ) ( )

in outp Pred b

out loc in blk

in outp Pred b

out loc in blk

b p

b b b b

b p

b b b b

∈

∈

=

= ∪ −

=

= ∪ −

U

I

COMPONENT COMPONENT

COMPONENT COMPONENT COMPONENT COMPONENT

SINKABLE SINKABLE

SINKABLE SINKABLE SINKABLE SINKABLE

HO

( )

( ){ }( )

( )

( )

( )

( ) ( )

( ) ( ) ( ) ( )

( ( ))( )

, if ( ( ))

in outs Succ b

out loc in blk

p Pred b outin

p Pred b out

out

b s

b b b b

pb

p

∈

∈

∈

=

= ∪ −

⎧ Φ⎪=⎨∅ Φ =∞⎪⎩

IISTABLE HOISTABLE

HOISTABLE HOISTABLE HOISTABLE HOISTABLE

GROUP OFFGROUP OFF

GROUP OFF

GROUP OFF

--

-

-

MIN

MIN

( )( ){ }

( )( )

( )

( )

( ) ( ) ( ) ( )

( ( ))( )

, if ( ( ))

( ) ( ) ( ) ( )

loc in blk

p Pred b outin

p Pred b out

out loc in blk

b b b b

pb

p

b b b b

∈

∈

= ∪ −

⎧ Φ⎪=⎨∅ Φ =∞⎪⎩

= ∪ −

GROUP OFF GROUP OFF GROUP OFF

GROUP ONGROUP ON

GROUP ON

GROUP ON GROUP ON GROUP ON GROUP ON

- - -

--

-

- - - -

MIN

MIN

Based on our work inACM TODAES 2006 & ACM TODAES 2007


DualDual--Core/MultiCore/Multi--Core Compiler Core Compiler with Language Supportwith Language Support

Fully control of dedicated Fully control of dedicated processors(sayprocessors(say, PAC DSP), PAC DSP)

Boot & initialize DSPBoot & initialize DSPBased on Based on pCorepCore, multi, multi--tasking tasking environmentenvironmentHighly efficient data exchange Highly efficient data exchange between ARM and DSPbetween ARM and DSP

Three layers frameworkThree layers frameworkLayer 0: Communication Layer 0: Communication LibraryLibrary

APIs for C programsAPIs for C programsConventional C library Conventional C library interfacesinterfaces

Layer 1: Layer 1: RemotingRemoting/RPC/RPCMinor CMinor C--extension languageextension languageEndEnd--toto--end service end service

Layer 2: Data flow control typeLayer 2: Data flow control type

44

• Software Architecture Design for Streaming Java RMI, C. C. Yang, Jenq Kuen Lee, et al, accepted, Science of Computer Programming.

• Streaming Support for Java RMI in Distributed Environment, C. C. Yang, Jenq-Kuen Lee, et al, ACM Principles and Practices of Programming In Java (PPPJ 2006), 2006.

• Efficient Switching Supports of Distributed .NET Remoting with Network Processors, Jenq-Kuen Lee, et al, ICPP 2005.

• Support and optimization of Java RMI over Bluetooth environments, P. C. Wey, JenqKuen Lee, et al, Concurrency and Computation:Practice and Experience, 2005;17:967-989, Wiley, 2005.

• Efficient Support of Java RMI over Heterogeneous Wireless Networks, Jenq-KuenLee, et al, ICC, Paris, June 2004.

• Specification and Architecture Supports for Component Adaptations on Distributed Environments, Chung-Kai Chen, Jenq KuenLee, et al, IPDPS 2004.


RPCRPC

Application

Communication runtimeRemote services invocation

Streaming servicesStreaming services

Operating systems pCorepCore

Architecture supported communication mechanisms

client server

VICVIC MailboxMailbox Shared Shared MemoryMemory DMADMA

Communication runtime

RegistryRegistryRegistry


Model the communication as end-to-end services Server: the remote processClient: the task that invokes a remote process

Remote invocation is like a local function invocationClient and server communicates through communicator The communication information is stored in a shared data structure called registry residentServices are off-loaded to DSPs by service processors

DDualual--Core Programming with Core Programming with Streaming RPCStreaming RPC

Key componentsKey componentsStreaming line: associated to an RPC request, provides a Streaming line: associated to an RPC request, provides a communication channel between the sender and the receivercommunication channel between the sender and the receiverStreaming buffer: associated to a streaming line for providing dStreaming buffer: associated to a streaming line for providing data ata bufferingbufferingStream controller: monitoring and managing the streaming buffersStream controller: monitoring and managing the streaming buffersfor supporting datafor supporting data--driven and optimization features driven and optimization features

Application


Remote services invocationStreaming servicesStreaming services

Operating systems

pCorepCore

Architecture supported communication mechanisms

client server

VICVIC MailboxMailbox Shared Shared MemoryMemory DMADMA


Streaming lineStreaming lineStreaming buffer

Stream ControllerStream Stream ControllerController

Demo Environment :PAC EVBARM Versatile Platform PB926

Platform 64MB Intel NOR Flash128MB 32-bit SDRAMEthernetVGA monitor output2 AHB-m , 1 AHB-s bridge

ARM 926EJ-S300M Hz0.18μ32KB cache

SoftwareLinux 2.6.17IPC Kernel Module

EVB (PAC DSP)Platform

512MB SDRAM1MB ARAM *2 , 1MB block RAM *2Audio out, video out

PAC DSP 3.0250 MHZ64KB data ram32KB instruction cache

SoftwarepCore v3.0IPC library

ARM VersatileARM Versatile

EVBEVB

pCore - PACDSP™ 微核心Kernel Feature

Static memory allocationpCore supports configurable kernel

Task ManagementpCore supports at most 16 tasksPriority based schedulerConstant time scheduler

Inter-Process Communication

pCore supports at most 4 mailboxes

SynchronizationpCore supports at most 4 semaphores

Dual-Core ProgrammingpCore supports APIs to help programmer to communicate with MPU(ARM).

2008/06/24 48

-41Pipe write

-41Pipe read

-21Pipe initialization

234284Mailbox pend (with context switching)

-70Mailbox pend ( no context switching)

218268Mailbox post (with context switching)

-54Mailbox post( no context switching)

220270Semaphore pend (with context switching)

-56Semaphore pend ( no context switching)

220270Semaphore post (with context switching)

-56Semaphore post( no context switching)

198249Task change priority

205254Task suspend (with context switching)

-91Task suspend ( no context switching)

-113Task delete

280321Task create (with context switching )

-147Task create (no context switching )

With MCSS*

No MCSS

Kernel service /cycle counts

* K.-Y. Hsieh, Y.-C. Lin, and J. K. Lee, “Enhancing microkernel performance on VLIW DSP processors via multiset context switch,” Journal of VLSI Signal Processing Systems, 2007.


ApplicationsApplicationswith Streaming RPCwith Streaming RPC

FeaturesFeaturesSupport RPCSupport RPC--level level abstractionabstractionDesign patterns for Design patterns for overlapping overlapping communication and communication and computationcomputationThreshold Threshold parameter for parameter for reducing amount of reducing amount of handhand--shakingshakingSupport Linux kernel Support Linux kernel 2.6.x 2.6.x Support two platformSupport two platform

PACPAC--EVBEVBTI OMAP 5912TI OMAP 5912

/ *Streaming RPC server* /void imdct ( ){

STREAM_ID id = 4 ;/* Initializing streaming channel* /stream_create( id ) ;/* Aggregating

from streaming channel*/stream_get( id , DATA) ;steram_pop( id ) ;…}

/ *Streaming RPC server* /void imdct ( ){

STREAM_ID id = 4 ;/* Initializing streaming channel* /stream_create( id ) ;/* Aggregating

from streaming channel*/stream_get( id , DATA) ;steram_pop( id ) ;…}

Example: Sample code of the streaming RPC implementation of an MP3 decoder

/* Streaming RPC client*/void MP3 decoder ( ){stream_rpc( imdct , transmitter) ;

}void transmitter( ){

STREAM_ID id = 4 ;/* Initializing streaming channel* / stream_create( id ) ;/* Pushing data

to streaming channel */stream_put( id , DATA) ;stream_push ( id ) ;…

}


Performance EvaluationPerformance EvaluationApplicationsApplications

JPEG decoder, resolution JPEG decoder, resolution of 317*255of 317*255MP3 decoder, 128k bitMP3 decoder, 128k bit--rate, 44100 sample raterate, 44100 sample rateQCIF H.264 decoderQCIF H.264 decoder

Streaming RPC improves Streaming RPC improves the average performance the average performance of the decoder by of the decoder by 30%30%

Application kernelsApplication kernelsIDCT of JPEG decoderIDCT of JPEG decoderIMDCT of MP3 decoderIMDCT of MP3 decoderIT/IQ of H.264 decoderIT/IQ of H.264 decoder


Dual-Core Toolkits整合DemoDual-core Multimedia Mobile Phone

Three multimedia decoder : JPEG, MP3, H.264One Simulated telecom decoding program (SIP)

Dual-OS environment with communication libraryMultiple tasks on DSP scheduled by pCore with priority-based policyWhile user making a phone call:

PAC DSP starts processing telecom decoding with highest priorityMedia decoding task is suspended and resumed at the end of phone calls

media thread

phone thread

media task

phone task

idle taskprogram startinitialize tasks Start media make a phone call media contiunes

AR

MD

SP

pllab.cs.nthu.edu.twpllab.cs.nthu.edu.twpllab.cs.nthu.edu.tw

經濟部學界開發產業技術計畫

雙核心嵌入式處理器架構技術虛擬平台-An Design Experience on

AndESLive

總計畫主持人李政崑教授國立清華大學積體電路設計技術研發中心

前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發Binutils

Compiler

GDB

Libraries

PACDSP + Andes Toolchain

Andes+PAC:雙核心ESL模擬環境

完成工作完成工作

ESLESL模擬：模擬：完成完成PACDSP v3.0 PACDSP v3.0 sidsid IPIPPACDSP IPPACDSP IP與與AndESLiveAndESLive IPIP整合整合Share memoryShare memory及及interruptinterrupt溝通溝通

開發工具：開發工具：

PACDSP Compiler PACDSP Compiler PACDSP PACDSP BinutilsBinutils

進行工作進行工作

作業系統：作業系統：

Dual core OS: Linux 2.6 Dual core OS: Linux 2.6 kernel(Andeskernel(Andes) and PACDSP ) and PACDSP pCore(NTHUpCore(NTHU) )

pllab.cs.nthu.edu.twpllab.cs.nthu.edu.tw

AndesSCore™ + PACDSP Co-Simulation

pllab.cs.nthu.edu.twpllab.cs.nthu.edu.twDual Core – JPEG Demo


獲獎紀錄獲獎紀錄

李政崑教授PACDSP Compiler教育部嵌入式軟體競賽優勝2008/7

劉志尉教授The third place award for the Soft-IPUn-assign Topic in The Silicon Intellectual Property (SIP) Design Contest

教育部 SIP競賽不定題組佳作2006

項目獲獎題目指導老師

2006/1 ASPDAC Outstanding Design Award 52mW 1200 MIPS Compact DSP for Multi-core Media SoC

劉志尉教授

2006/11 Workshop on Consumer Electronics and Signal Processing 學生論文獎

Software implementation of MPEG-4 video decoder on the PACDSP platform

林大衛教授

2006/12 中華民國資訊學會碩博士最佳論文獎碩士論文獎佳作

降低漏電電流暨增進效能之集合關聯式快取記憶體李嘉哲

黃元欣教授

2006/12 中華民國資訊學會碩博士最佳論文獎碩士論文獎佳作

疊加網路上軟體元件遠端呼叫之資料串流支援楊智傑

李政崑教授

2007/7 旺宏金矽獎優勝雙核心系統整合開發工具李政崑教授

2007/7 教育部嵌入式軟體競賽優勝 PACDSP BIOS之設計與驗證李政崑教授

2007/11 National Symposium on Telecommunications學生論文優等獎

Software implementation of MPEG-4 object-based video decoder on a dual-core digital signal processor platform

林大衛教授

2007/12/22 微電腦應用系統設計製作競賽 - 嵌入式軟體系統類研究所組第三名

Fast Motion Estimation in H.264 for PAC DSP 賴尚宏教授

2008/1 中華民國資訊學會碩博士最佳論文獎博士論文獎優等

低功率嵌入式處理器之編譯器最佳化研究游逸平

李政崑教授

2008/1 中華民國資訊學會碩博士最佳論文獎碩士論文獎優等

支援數位訊號處理器之微核心設計及雙核心開發環境黃建今

李政崑教授


CIC Distribution:PACDSP處理器開發工具整合套件

與STC聯合提供國家晶片系統設計中心(CIC) PACDSP 處理器開發工具整合套件

簽約日期：97.03.10簽約金額：2,000,000(15% income for NTHU MOEA Team)

項目DSP tool chain on PMP EVBC-library for PMP EVBIntegrated development environmentPMP Spec /課程大綱Training course /LAB軟體使用環境說明書週邊硬體使用者手冊DSP programmingPMP EVB overview


SunPlus SPCT6100

ESW平台開發機制

Platform技術處

Education教育部

Research國科會

Development工業局基於台灣自有之處理

器核心建構自有平台與軟體之核心技術

培育自有平台與軟體之應用與研發人才

推動平台與軟體商品化之多元應用

透過開放軟體延伸平台之多元應用並研發性能提升之前瞻技術

國科會自由軟體與嵌入式系統學術研發應用科技計畫

計畫推動策略重視研發品質管理

軟體品質管理流程研討、(需求分析、系統設計、系統測試)三階段文件評鑑、成果公開評鑑

加強法人研究機構、公協會及產業界的參與及合作-黃金企鵝獎

強化學術團隊與自由軟體社群的交流

多元方式獎勵成果推廣與績優團隊

提供訓練課程、技術支援與諮詢

鼓勵使用嵌入式平台進行發展、研究以及嵌入式系統競賽之系統發展

推動共用平台開發產業聯盟(TEIA)

成立目的：推廣共用平台至「產、學、研」單位不定期舉行開發者技術分享研討會提供平台開發者交流網站

成立步驟邀請共用平台廠商制定聯盟章程廣邀國內嵌入式軟硬體廠商加入

規模負責聯盟事務

推動共用平台開發者產業鏈(建立eco-system)

系統整合開發

嵌入式軟體

嵌入式硬體IC設計

IC設計服務SIP供應商

嵌入式OS/軟體

設計服務

工業控制

網路

醫療

消費性電子

PC/PC周邊手機

汽車

國內知名廠商國外知名廠商

威盛、凌陽、瑞昱、聯發科, etc

創意, 智原

晶心

智崴,系微, 滾雷

環隆電氣, 新華

威達電, 研華科技

合勤,零壹, 友訊

Agilent

MSI, Gigabyte

Acer, Asus

宏達電,啟基

裕隆, 中華

Qualcomm, Boradcom

Wipro

ARM, MIPS

Microsoft, Wind River, VenturCom, Quadros, QNX Software Sys,

Palm

Flextronics, Solectron

Kontron

3 com, CISCO

Omron

Sony, Apple

HP, DELL

Nokia, Motorola

Toyota, Benz, BMW

SummarySystem software will be playing significant roles with IC industries in Taiwan.Dual-core and multi-core solutions are now the norm in embedded systems.Introduction to STAR Processor IPs.We present an enabled flow for performing PAC Compiler and Toolkits.Our compiler could generate efficient codes for a set of DSP loop kernels

Documents

Compilers and System Software for Embedded DSP …mdker/courses/Seminar2008/Week03.pdf經濟部 學界開發產業技術計畫 Compilers and System Software for Embedded DSP Processors

Compilers and System Software for Embedded DSP …mdker/courses/Seminar2008/Week03.pdf經濟部學界開發產業技術計畫 Compilers and System Software for Embedded DSP Processors