Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
學界開發產業技術計畫經濟部
Compilers and System Software for Embedded DSP Processors
全 程 計 畫:自94年02月01日至97年01月31日止國立清華大學積體電路設計技術研發中心
國立交通大學電子資訊研發中心
李政崑教授
2
嵌入式系統軟體商機
Marketing, Design, Library Control, EDA Tools and
Silicon
Marketing, Design, Library Control,
EDA Tools
Marketing, Design
Marketing
9.4%10,737
74.7%85,726
8.4%9,693
7.5%8,644
Late Adopter
9.8%18,918
45.0%86,868
31.0%59,843
14.2%27,412
Late Adopter
Upper Mainstream
Lower Mainstream
Power Users (Seats)
Upper Mainstream
Lower Mainstream
Power Users (Seats)
Marketing, Design, Library Control, EDA Tools and Silicon and
SoftwareMarketing, Design,
Library Control, EDA Tools and
Silicon
Marketing, Design, Library Control,
EDA Tools
Marketing, Design
Marketing
Semiconductor Design Pyramid, 2002(左圖), 2005(右圖) Source: Forecast: Embedded Software Tools, Worldwide 2004-2009, Gartner (Aug. 2005)
Why will software be a concern tohardware people?
Case Study 1: Low Power Decisions Are Made Very Early in the Design FlowCase Study 1: Low Power Decisions Are Made Very Early Case Study 1: Low Power Decisions Are Made Very Early in the Design Flowin the Design Flow
Power Reduction Percentage
Production
Place & RouteClock-tree, gate-level
RTL SynthesisClock gating, μArch, multi-Vt
ArchitectureMulti-Voltage islands, sleep mode, etc.
System DesignAlgorithms, IP, S/W, etc..
Design
Implementation
Source: Cadence Encounter Low-Power Design Flow Seminar and Workshop
Case Study 2:Audio DSP Addressing
Case Study 2:Case Study 2:Audio DSP Audio DSP AddressingAddressing
I registers• 14-bit addressing• Contain the actual address used to access memoryM registers• 14-bit addressing• Used for post-modify schemeL registers• 14-bit addressing• Provide the modulus logic with the length information
Memory Addressing Space• 14-bit(16k) addressing
24-bit(16M)addressing
• I, M, L registers NOchanges (all 14-bit size)
• Apply two “page”registers for addressing
DMPAGE – 10-bitPMPAGE – 10-bit
We DO NOT know when does the carry happen even if we provide DMPAGE and PMPAGE registers for 10-bit extra addressing
DMPAGE I,M,L register01323
Better Solution for Memory ExtensionBetter Solution for Better Solution for Memory ExtensionMemory Extension
Better solution 1. Hardware automatically complete carry
calculation for “page” registers2. No “page” registers. Instead, extend I,M,L
registers to 24-bit or design I’, M’, L’ to be 24-bit
Case Study 3:Saturation Instruction
Case Study 3:Case Study 3:Saturation InstructionSaturation Instruction
Instruction 1, Saturation-Yes-NoInstruction 2, Saturation-Yes-NoInstruction 3, Saturation-Yes-NoInstruction 4, Saturation-Yes-No
Saturation-onInstruction 1Instruction 2Instruction 3Instruction 4Saturation-off
Code-1Saturation-onInstruction 1Instruction 2Instruction 3Code-2Instruction 4Saturation-off
Compiler Intrinsic FunctionCompiler Intrinsic FunctionCompiler Intrinsic Function
C level functions with several assembly instructions and expanded into compiler intermediate instruction. L_SHR
L_SHL
L_SUB
L_ADD
L_MSU
L_MAC
ROUND
L_MULT
SATURE
Code-1Saturation-onInstruction 1Instruction 2Instruction 3Instruction 4Saturation-offCode-2
Code-1SatureCode-2
ESWESW重點技術項目重點技術項目平台技術與嵌入式軟
體
創新應用系統及軟體
(數位生活)數位家庭應用軟體
娛樂、資訊、控制、服務等應用。
個人可攜式消費性電子產品
通訊、娛樂、資訊等行動裝置應用。
不限定國內外嵌入式系統
AP1
AP2
AP3
AP4
AP5
AP6
AP7
TWMPUs
DSP,Multi-DSP
TW Platform
Compiler RT OS
Embedded Software
Star IP ProgramPerspective: To coordinate resources from the industry, universities, and research organizations in proactive development of architecture and integrated software, building upTaiwan’s industrial leadership in Silicon IP
Basic concepts:
CPU/DSP/MPU
Innovative IPs for multi-band and digitalization are required for future wireless/mobile products
Multimedia IPs
NetworkingInterconnection/Bus/Memories core
StarIP Programs
☆
☆
☆
☆
Key Techniques in Chip Systems
Transmission Links in Chip Systems Ultra Low Power
(ULP) DSP Core
☆
Messenger – Distributed Radio Transmitter System
Low-Power Dual-Processor System
A-Core
SunplusS-Core
STCPAC
ESW 系統及應用參考平台
Multi-Core Micro Kernel
GSM, 3G, WiMAX
Multi-Core,MPU + VLIW DSP
Ethernet, 1394
DMA / Memory Control
Multi-Core interconnection
USB, HDMI
ESL: Virtual Platform Design
Multimedia Accelerator
Applications
Multi-core SoCPlatform
System Kernel Software
Multi-Core IPC
Power Mgmt
DSP Middleware
Streaming Models
DirectFB, OpenGL ES
Network protocol, SIP, RTP
MiddlewareGUI, 3D, JSR-184/239DLNAJVM, OSGi
Offi
ce, P
DF
Med
ia
Play
er
Imag
e Vi
ewer
Secu
rity
SW
DR
M
GPS
Nav
.
Mai
l
Web
B
row
ser
VOIP
, IM
Java
AP,
Jav
a 3D
P2P Stream Server
Web Service
Web Container
Multimedia Portal
Digital Camera
Wireless
Transcoding Biometrics Crypto Engine
Graphics AV CODECs Baseband
Embedded Linux + RTOS
Multi-Core IDE and Debugger
Multi-Core Compiler Toolkits
APIs
學界開發產業技術計畫經濟部
計畫架構
前瞻高效能低耗能之雙處理器系統技術研發計畫
權重: 100%
A. 系統軟體研發與設計(李政崑教授,清大資工)
C.作業系統與應用程式之研發(林大衛教授,交大電子)
B.平行訊號處理器晶片開發與電路設計(黃威教授,交大電子)
(人力、經費)第一年度 29.20% 33.26%第二年度 29.20% 33.26%第三年度 29.20% 33.26%
(人力、經費)第一年度 37.37% 33.17%第二年度 37.37% 33.17%第三年度 37.37% 33.17%
(人力、經費)第一年度 26.78% 16.42%第二年度 26.78% 16.42%第三年度 26.78% 16.42%
1.超長指令集數位訊號處理器編譯器之設計與研發(李政崑教授,清大資工)(黃元欣教授,海大資工)
2. 數位訊號處理器相關函式庫之研發及系統效能評估(石維寬教授,清大資工)(李政崑教授,清大資工)
1. 物件導向執行環境(陳俊穎教授,交大資科)
2. Low Power Logic and Memory Units(黃威教授,交大電子)
1. Extensible and Energy-Aware Micro-architecture(劉志尉教授,交大電子)
3. Interface Circuit Design and Bit- CoprocessingAccelerator(張添烜教授,交大電子)
2. 多媒體演算法之系統分析及設計研發(林大衛教授,交大電子)(賴尚宏教授,清大資工)
4. 雙處理器開發工具整合與測試之系統規劃(李政崑教授, 清大資工)
3. 雙處理器系統展示平台之開發與規劃(分包工研院STC)
(人力、經費)第一年度 6.66% 17.2%第二年度 6.66% 17.2%第三年度 6.66% 17.2%
前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發
94/2 5 8 11 95/2 5 8 11 96/2 5 8 11
全程計畫產出時程表全程計畫產出時程表計畫產出
Debugger(2.0)
Compiler(3.0)
PACDSP 2.0 PACDSP 3.0
Debugger(3.0)
Microkernel(3.0)
Assembler/linker(2.1)
Assembler/linker(3.0)SIMD/sub-word/GRA
NewLib
Dual-core Programming model
Compiler Optimizations
Microkernel
Compiler Toolchain
Multi-mode On-demand SRAMAsynchronous FIFO in GALS
Microkernel(2.0)
Embedded DRAM
Architecture
Low-Power
Pica 2.0 Pica 3.0 Multi-mode SARM for Virtual Cluster DSP
Multicore DSP Interface
KVM on PAC
GALS-based FFTMultiple Supply Voltage System
Frame-based MPEG4 decoder
JVM with JIT compiler on PAC
Efficient Intra-Prediction
Fast-Intermode Decision(static Learning)
Simple ME Object-based MPEG4 decoderFast Intra-Prediction
Intra-Prediction on PACFast Motion Estimation
Java Environment
Video CodingObject-based MPEG4 encoder
Frame-based MPEG4 encoder
Multi-LevelPower Management
Low Power TCAM(0.26fj/bit/Search) Controllable Dual-PortLow Power SRAM
DFVM for Energy-Aware FFT
Streaming Programming model
GRA/SIMD Compilers
NewLib on PMP SoC
2
ESL Platform
Toolkits for CIC
Controllable SRAMfor Virtual Cluster DSP
Multi-threaded DSP Interface
CVM/KVM/MIDP 2.0 BenchmarkingAOT Compilation Prototype
前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發 15
2007.3Enable PAC DSP System Newlib
2007.3Enable PAC DSP System Newlib
PAC3.0
PAC3.0
PAC2.1
PAC2.12005.11
PAC DSP 2.1 Assembler/Linker Available2005.11
PAC DSP 2.1 Assembler/Linker Available 2005.12PAC DSP 3.0 Document Available2005.12PAC DSP 3.0 Document Available
2005.9PAC DSP 2.1 Document Available2005.9PAC DSP 2.1 Document Available
2005.9PAC DSP 2.0 Debugger Available
2005.9PAC DSP 2.0 Debugger Available
2005.10PAC DSP 2.0 Microkernel Available
2005.10PAC DSP 2.0 Microkernel Available
PAC2.0
PAC2.0
PAC3.1
PAC3.1
2006.2 PAC DSP 3.0 ISS Available2006.2 PAC DSP 3.0 ISS Available2006.3PAC DSP 2.0 Real Chip Demo2006.3PAC DSP 2.0 Real Chip Demo2006.4
PAC DSP 3.0 Compiler Available2006.4
PAC DSP 3.0 Compiler Available
2006.5PAC DSP 3.0 RTL Freeze / Tape-out2006.5PAC DSP 3.0 RTL Freeze / Tape-out
2006.7PAC DSP 3.0 Microkernel Available
2006.7PAC DSP 3.0 Microkernel Available
2006.10PAC DSP 3.0 Real Chip Testing Passed2006.10PAC DSP 3.0 Real Chip Testing Passed
2006.2PAC DSP 3.0 Assembler/Linker Available
2006.2PAC DSP 3.0 Assembler/Linker Available
2006.3PAC DSP 3.0 Debugger Available
2006.3PAC DSP 3.0 Debugger Available
2006.5PAC DSP 3.0 Software/Hardware Testing
Criteria Passed
2006.5PAC DSP 3.0 Software/Hardware Testing
Criteria Passed
2007.1PAC DSP 3.0/3.1 Optimizing Compiler Available
2007.1PAC DSP 3.0/3.1 Optimizing Compiler Available
2007.1PAC PMP SOC Tape-outARM+DSP H.264 VGA Realtime Demo Passed
2007.1PAC PMP SOC Tape-outARM+DSP H.264 VGA Realtime Demo Passed
2007.4SIMD with Sub-word Instructions/Global RFA
2007.4SIMD with Sub-word Instructions/Global RFA
2007.6Programming Model Design for PAC II
2007.6Programming Model Design for PAC II
Tool Chain: Compiler & MicroTool Chain: Compiler & Micro--KernelKernel
前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發 16
2007.3Enable PAC DSP System Newlib
2007.3Enable PAC DSP System Newlib
PAC3.1
PAC3.1
2007.1PAC DSP 3.0/3.1 Optimizing Compiler Available
2007.1PAC DSP 3.0/3.1 Optimizing Compiler Available
2007.1PAC PMP SOC Tape-outARM+DSP H.264 VGA Realtime Demo Passed
2007.1PAC PMP SOC Tape-outARM+DSP H.264 VGA Realtime Demo Passed
2007.4SIMD with Sub-word Instructions/Global RFA
2007.4SIMD with Sub-word Instructions/Global RFA
2007.6Streaming RPC Programming Model Design for PAC II
2007.6Streaming RPC Programming Model Design for PAC II
Tool Chain: Compiler & MicroTool Chain: Compiler & Micro--Kernel Kernel (Continue(Continue--2008)2008)
2008. 01PAC DSP Toolkits Delivered for CIC
2008. 01PAC DSP Toolkits Delivered for CIC
2007.11PAC DSP Debugger on PAC PMP SoC Board
2007.11PAC DSP Debugger on PAC PMP SoC Board
2007.10PAC PMP SoC Board Available2007.10PAC PMP SoC Board Available
2007.9Pass EEMBC Telecom Test-suite
2007.9Pass EEMBC Telecom Test-suite
2007.10Pass Major MiBench Test-suite
2007.10Pass Major MiBench Test-suite
2008. 02Enable PAC DSP newlib support PAC PMP SoC Board
2008. 02Enable PAC DSP newlib support PAC PMP SoC Board
2008.3PACDSP+AndeSCore Dual-Core ESL platform
2008.3PACDSP+AndeSCore Dual-Core ESL platform
2007.8Communication APIs
2007.8Communication APIs
2008. 03Multi-core communication API runtime
2008. 03Multi-core communication API runtime
前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發
Development Software
Middleware
Dual-Core OS/Architecture/ESL
科專產出科專產出: : 整合多媒體雙核心應用平台整合多媒體雙核心應用平台
AV.CodecsJava VM
pCoreLinux
RPC
Streaming PRC
DSP Lib.
OpenMAX IL
ESL: Virtual PlatformMPU+PACDSP
OpenMAX DLCompiler Toolkits
ApplicationGUI
SIP Phone Image ViewerMedia PlayerJava AP
Web Browser
透過網路達成VoIP的電話功能
提供PACDSP H.264及MPEG4 Codecs
提供網路瀏覽應用
相片瀏覽功能
影像播放功能
提供雙核心彼此間的呼叫與資料傳遞
提供NewLib,作為C語言函式庫
提供完整的軟體開發環境(IDE, Compiler,
Debugger, Assembler, Binutilus, …)
提供Linux 2.6的作業系統環境
雙核心高效能硬體平台(A3, STC技術支援)
系統層級的模擬環境(衍生計畫產出)
提供OpenMax IL層的多媒體呼叫,以及
OpenMax DL層的函式庫設計
可執行Java應用程式,提供Java VM
提供Streaming方式的雙核心RPC資料傳遞
藉由pCore(DSP微核心),提供多重函式呼
叫與工作切換
PACDSP Kernel
From ITRI STC
Clustered Architecture
Scalable Clustered ArchitectureScalable computation power fits the different application requirement
Distributed Register FileDistributed register file reduce the area and power consumption
Cluster From ITRI STC
Distributed Register File- A, PP,C and AC Register File
Ping-pong Register FileReduce Move OperationReduce the port number in each register file
Constant Register FileReduce Load/Store OperationReduce the power consumption in coefficient access
From ITRI STC
Clustered Data Path and Distributed Register File- Evaluation Result
Compare with Centralized Register File
Area : 76.8% area are saved Access Time : 46.9% access time are saved
From ITRI STC
DSP Compilers
• Heterogeneous Registers• Addressing Modes• Limited Connectivities• Local Memories• Harvard Architecture• Sub-word/SIMD Performances
前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發
Key Compiler TechnologiesVLIW DSP compilers for distributed Register FilesPALF scheduling policies for ILP (CPC 2006)SA-based scheme for ILP (LCPC 2005)GRA scheme for distributed register files (CPC 2007)Copy propagations for distributed register architectures (LCPC 2006)SWP flow for distributed register file architectures (ACM LCTES 2007)SIMD compiler optimizationsEnable compiler + intrinsics/extrinsicsCompiler for low-power schemes (ACM EMSOFT 2005/ACM TODAES 2007)
Irregular Register File Usage
Local RFAccessible Only by Dedicated FU
Global RFShared by M-I PairAccess Limited byPing-pong Switches
Constant RFWritable only by M-Unit1 Read-only Access Per UnitOnly Usable by Some Instructions
M-Unit
A Registers
M-Unit
A Registers
I-Unit
AC Registers
I-Unit
AC Registers
B-Unit
R Registers
M-Unit M-Unit
I-Unit I-Unit
D Registers D Registers
M-Unit M-Unit
I-Unit I-Unit
CRegisters
CRegisters
Maximal 2 Read Ports + 1 Write PortLow cost and low power
(Compared to TI C6: 10 Read Ports + 6 Write Ports)
1. Register Allocation for PAC DSP
How to adapt register allocation to the features of irregular register files?
Various RFs for each FUPing-pong Bank SwitchesComplex Register File Communication
Our proposed solutions:SA-RA: A simulated-annealing approach (LCPC 2005)
Random based nondeterministic iterative searchPALF-RA: A more direct and faster heuristic approach
PALF Register Allocation Scheme (CPC 2006)
Solution 1:Simulated-Annealing (SA) Based Register Allocation Approach
Motivation:Complex interference from:
We appreciate a machine-learning method to give a near-optimal results.To be a base reference for developing a faster heuristic method!
Register Allocation
InstructionScheduling
Code Insertionfor Distributed Register
Communication
To Determine: Virtual Register Register File (Bank)
Input: un-scheduled instructionsOutput: a schedule of the instructions
a register file assignment (RFA) map
RFA map = {(v1, f1), (v2, f2), ...} Where vi : a virtual register, fi : a register file (bank)
• Setup SA:1. An initial random RFA map2. sched_len = PAC_Scheduler ( initial RFA map )3. SA control variables:
• threshold• p_test: a probability test value (0 < p_test < 1).• energy: initial value > threshold.
PAC_Scheduler:Graph-coloring based register allocation according to the RFA mapInstruction scheduling and code insertion for register file communication
To Optimize: Scheduling Result
Randomly change:a mapping (vi, fi)
Re-run:new_schedule_len =
PAC_Scheduler (new RFA map)
new RFA map
Better result test:new_schedule_len < schedule_len
energy--schedule_len =
new_schedule_len
Random test:a random number > p_test
energy++
yes
yes
no
no
new RF
A map
old RFA map
SA stop test:energy > threshold
yes
FinalRFA map
&schedule
no
Solution 2:Concepts for PALF Register Allocation Scheme
Which RF to be allocated?
How to utilize ping-pong banks?
How to utilize clusters?
LessUsability
More Scheduling Interference
LocalRF
ConstantRF
GlobalRF
HigherPriority
To maximize parallelism between banks! Hard to determine without scheduling
If the results are not worse than using global RF,why not use local RF instead?
M-Unit I-UnitOnly allocate Global RF for:
To partition operations into two clusters if it worth!
PALF Register Allocation Scheme
MaximalLocalization
Register FileAssignment
Ping-pongRegister BankAssignment
ClusterAssignment
CommunicationCode Insertion
Post-passRegister
Allocation
BuildCRTA-DDG
2-Cluster Code?Yes
No
Ping-pong Aware Local Favorable (PALF)To allocate local RF firstTo assign ping-pong banks for minimizing interference
PALF determines RF allocation, then applies usual register allocation to each RF
Experiment Platform
EnvironmentPACDSP Compiler (using ORC infrastructure)PACDSP binutils(modified GNU binutils)Instruction Set Simulator
Cycle accurate
BenchmarkDSP-stone
System-SoftwareDevelopment Suite
Profiler
Debugger
Libraries
Assembler Linker
C/C++ Compiler
InstructionSet
Simulator
DSPstone Testing Patterns -Speedup
前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發
2. Enable GRA (Global Register File Assignment) 2. Enable GRA (Global Register File Assignment) for PAC Architectures for PAC Architectures
•• Extends Local (single block) RFA to Global Extends Local (single block) RFA to Global RFARFA
•• Block prioritization for Global RFABlock prioritization for Global RFA–– Based on loop depth, scheduling length, Based on loop depth, scheduling length,
frequency (static or profiled), etc.frequency (static or profiled), etc.–– Prioritizing more Prioritizing more ““importantimportant”” blocks that have blocks that have
greater effects on performancegreater effects on performance–– Try to minimize modification to the Try to minimize modification to the RFAsRFAs of the of the
blocks with higher priority. That is, keep them blocks with higher priority. That is, keep them untouched as possibleuntouched as possible
•• A LocalA Local--Conscious Global Register Conscious Global Register AllocatorAllocator for VLIW for VLIW DSP Processors with Distributed Register Files, DSP Processors with Distributed Register Files, ChiaChia Han Han Lu, YoungLu, Young--JiaJia Lin, YiLin, Yi--Ping You, Ping You, JenqJenq--KuenKuen Lee, CPC Lee, CPC 2007 (Compilers for Parallel Computing 2007), Lisbon, 2007 (Compilers for Parallel Computing 2007), Lisbon, Portugal, July 9Portugal, July 9--11, 200711, 2007
前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發
Example of Global RFAExample of Global RFA
BB1
BB2
BB4
BB3
BB5 BB6
BB7
BB8
前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發
Experimental Result of Global RFAExperimental Result of Global RFA
前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發
PACDSP SubPACDSP Sub--word Instructionsword Instructions
• PAC DSP Sub-word instructions– Three execution modes
• Single: ADD Rd, Rs1, Rs2 (one 32-bit)• Dual: ADD.D Rd, Rs1, Rs2 (two 16-bit)• Quad: ADD.Q Rd, Rs1, Rs2 (four 8-bit)
– Sub-word calculation (3 modes)
32 bit32 bit
32 bit32 bit
32 bit32 bit
+
=
16 bit16 bit+
=
16 bit16 bit
16 bit16 bit 16 bit16 bit
16 bit16 bit 16 bit16 bit
88 88 88 88++
=88 88 88 88
+ + +
= = = =
88 88 88 88Single Dual Quad
前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發
Performance for SIMD on PAC CompilerPerformance for SIMD on PAC Compiler
0
0.5
1
1.5
2
2.5
bkfir lms ssfir vecadd vecdot
Spe
edup
O0+ILO0+SIMDO1+ILO1+SIMDO2+ILO2+SIMD
0
2
4
6
8
10
12
14
16
18
bkfir lms ssfir vecadd vecdot
Spee
dup O0
O1+ILO2+ILO2+SIMD
PS. IL means IPA and LNO optimization
Fig. Improvem
ent for each optim
ization levelFig. O
verall Performance
Dspstone performance
前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發
Compiler Performance with G.723.1a Application
G.723.1a Decoder
0
2
4
6
8
10
12
TI O0 TI O1 TI O2 TI O3 PACC O0 PACC O1 PACC O2 PACCO2+IPA+WOPT
PACC withExtrinsic
Cyc
le N
orm
aliz
ed to
TI O
3
G.723.1a Encoder
0
2
4
6
8
10
TI O0 TI O1 TI O2 TI O3 PACC O0 PACC O1 PACC O2 PACCO2+IPA+WOPT
PACC withExtrinsic
Cyc
le N
orm
aliz
ed to
TI O
3
前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發
Compiler with SIMD Intrinsic/Extrinsic Support
Extrinsic
Register pair (64-bit)SWAP4E
Fix-pointSWAP4
SaturatedUNPACK4(U)
Intrinsic supportedRORPACK4
ROLUNPACK2(U)
INSERTISWAP2
L_SHRINSERTPERMH4
L_SHLXDOTP2EXTRACTI(U)ADDC(U)PERMH2
L_SUBSAA.QDOTP2EXTRACT(U)ABS.(D/Q)LIMB(U)CP
L_ADDRNDXMSU.DSRAI.(D)NEGLIMHW(U)CP
L_MSUBF.(D)XMAC.DSRA.(D)MERGESLIMW(U)CP
L_MACSFRA.(
D)XFMAC.(D)XMUL.DSRLI.(D)MERGEACOPY
ROUNDLMBDMSU.DXFMUL.(D)SRL.(D)SUB(U).(D/Q)MOVI(U).H
L_MULTCLSMAC.DMUL.DSLLI.(D)ADDI.(D)(D)MAX(U).(D/Q)MOVI(U)
SATURE(D)CLRFMAC(uu/us/su).(D)FMUL(uu/su/us).(D)SLL.(D)ADD(U).(D/Q)(D)MIN(U).(D/Q)MOVI.L
ExtrinsicSpecialMultiply & AddMultiplicationBit
ManipulationArithmeticComparisonData
Transfer
前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發
MiBenchMiBench Performance for PAC CompilerPerformance for PAC Compiler
前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發
EEMBC PerformanceEEMBC Performance
前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發
EEMBC PerformanceEEMBC Performance
前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發
Compiler for Low-Power with Power Gating Instruction
0%
5%
10%
15%
20%
25%
30%
35%
Code
Siz
e G
rout
h
complex_m
ultiply
complex_u
pdateconv
olution
dot_produc
tfir2di
m fir
irr_biquad
_N_sectio
ns
irr_biquad
_one_sec
tion lmsmatri
x 1x3 matri
x
n_comple
x_updates
n_real_up
datesreal_
updateavera
ge
CADFE CADFE w/Sink-N-Hoist
To turn off useless components in processors. Use compiler analysis techniques to analyze program behaviorsCompilers insert power-gating instructions and try to merge those instructions.
( )
( )
( )
( )
( ) ( )
( ) ( ) ( ) ( )
( ) ( )
( ) ( ) ( ) ( )
in outp Pred b
out loc in blk
in outp Pred b
out loc in blk
b p
b b b b
b p
b b b b
∈
∈
=
= ∪ −
=
= ∪ −
U
I
COMPONENT COMPONENT
COMPONENT COMPONENT COMPONENT COMPONENT
SINKABLE SINKABLE
SINKABLE SINKABLE SINKABLE SINKABLE
HO
( )
( ){ }( )
( )
( )
( )
( ) ( )
( ) ( ) ( ) ( )
( ( ))( )
, if ( ( ))
in outs Succ b
out loc in blk
p Pred b outin
p Pred b out
out
b s
b b b b
pb
p
∈
∈
∈
=
= ∪ −
⎧ Φ⎪=⎨∅ Φ =∞⎪⎩
IISTABLE HOISTABLE
HOISTABLE HOISTABLE HOISTABLE HOISTABLE
GROUP OFFGROUP OFF
GROUP OFF
GROUP OFF
--
-
-
MIN
MIN
( )( ){ }
( )( )
( )
( )
( ) ( ) ( ) ( )
( ( ))( )
, if ( ( ))
( ) ( ) ( ) ( )
loc in blk
p Pred b outin
p Pred b out
out loc in blk
b b b b
pb
p
b b b b
∈
∈
= ∪ −
⎧ Φ⎪=⎨∅ Φ =∞⎪⎩
= ∪ −
GROUP OFF GROUP OFF GROUP OFF
GROUP ONGROUP ON
GROUP ON
GROUP ON GROUP ON GROUP ON GROUP ON
- - -
--
-
- - - -
MIN
MIN
Based on our work inACM TODAES 2006 & ACM TODAES 2007
前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發
DualDual--Core/MultiCore/Multi--Core Compiler Core Compiler with Language Supportwith Language Support
Fully control of dedicated Fully control of dedicated processors(sayprocessors(say, PAC DSP), PAC DSP)
Boot & initialize DSPBoot & initialize DSPBased on Based on pCorepCore, multi, multi--tasking tasking environmentenvironmentHighly efficient data exchange Highly efficient data exchange between ARM and DSPbetween ARM and DSP
Three layers frameworkThree layers frameworkLayer 0: Communication Layer 0: Communication LibraryLibrary
APIs for C programsAPIs for C programsConventional C library Conventional C library interfacesinterfaces
Layer 1: Layer 1: RemotingRemoting/RPC/RPCMinor CMinor C--extension languageextension languageEndEnd--toto--end service end service
Layer 2: Data flow control typeLayer 2: Data flow control type
44
• Software Architecture Design for Streaming Java RMI, C. C. Yang, Jenq Kuen Lee, et al, accepted, Science of Computer Programming.
• Streaming Support for Java RMI in Distributed Environment, C. C. Yang, Jenq-Kuen Lee, et al, ACM Principles and Practices of Programming In Java (PPPJ 2006), 2006.
• Efficient Switching Supports of Distributed .NET Remoting with Network Processors, Jenq-Kuen Lee, et al, ICPP 2005.
• Support and optimization of Java RMI over Bluetooth environments, P. C. Wey, JenqKuen Lee, et al, Concurrency and Computation:Practice and Experience, 2005;17:967-989, Wiley, 2005.
• Efficient Support of Java RMI over Heterogeneous Wireless Networks, Jenq-KuenLee, et al, ICC, Paris, June 2004.
• Specification and Architecture Supports for Component Adaptations on Distributed Environments, Chung-Kai Chen, Jenq KuenLee, et al, IPDPS 2004.
前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發
RPCRPC
Application
Communication runtimeRemote services invocation
Streaming servicesStreaming services
Operating systems pCorepCore
Architecture supported communication mechanisms
client server
VICVIC MailboxMailbox Shared Shared MemoryMemory DMADMA
Communication runtime
RegistryRegistryRegistry
Communication runtime
Model the communication as end-to-end services Server: the remote processClient: the task that invokes a remote process
Remote invocation is like a local function invocationClient and server communicates through communicator The communication information is stored in a shared data structure called registry residentServices are off-loaded to DSPs by service processors
DDualual--Core Programming with Core Programming with Streaming RPCStreaming RPC
Key componentsKey componentsStreaming line: associated to an RPC request, provides a Streaming line: associated to an RPC request, provides a communication channel between the sender and the receivercommunication channel between the sender and the receiverStreaming buffer: associated to a streaming line for providing dStreaming buffer: associated to a streaming line for providing data ata bufferingbufferingStream controller: monitoring and managing the streaming buffersStream controller: monitoring and managing the streaming buffersfor supporting datafor supporting data--driven and optimization features driven and optimization features
Application
Communication runtime
Remote services invocationStreaming servicesStreaming services
Operating systems
pCorepCore
Architecture supported communication mechanisms
client server
VICVIC MailboxMailbox Shared Shared MemoryMemory DMADMA
Communication runtime
Streaming lineStreaming lineStreaming buffer
Stream ControllerStream Stream ControllerController
Demo Environment :PAC EVBARM Versatile Platform PB926
Platform 64MB Intel NOR Flash128MB 32-bit SDRAMEthernetVGA monitor output2 AHB-m , 1 AHB-s bridge
ARM 926EJ-S300M Hz0.18μ32KB cache
SoftwareLinux 2.6.17IPC Kernel Module
EVB (PAC DSP)Platform
512MB SDRAM1MB ARAM *2 , 1MB block RAM *2Audio out, video out
PAC DSP 3.0250 MHZ64KB data ram32KB instruction cache
SoftwarepCore v3.0IPC library
ARM VersatileARM Versatile
EVBEVB
pCore - PACDSP™ 微核心Kernel Feature
Static memory allocationpCore supports configurable kernel
Task ManagementpCore supports at most 16 tasksPriority based schedulerConstant time scheduler
Inter-Process Communication
pCore supports at most 4 mailboxes
SynchronizationpCore supports at most 4 semaphores
Dual-Core ProgrammingpCore supports APIs to help programmer to communicate with MPU(ARM).
2008/06/24 48
-41Pipe write
-41Pipe read
-21Pipe initialization
234284Mailbox pend (with context switching)
-70Mailbox pend ( no context switching)
218268Mailbox post (with context switching)
-54Mailbox post( no context switching)
220270Semaphore pend (with context switching)
-56Semaphore pend ( no context switching)
220270Semaphore post (with context switching)
-56Semaphore post( no context switching)
198249Task change priority
205254Task suspend (with context switching)
-91Task suspend ( no context switching)
-113Task delete
280321Task create (with context switching )
-147Task create (no context switching )
With MCSS*
No MCSS
Kernel service /cycle counts
* K.-Y. Hsieh, Y.-C. Lin, and J. K. Lee, “Enhancing microkernel performance on VLIW DSP processors via multiset context switch,” Journal of VLSI Signal Processing Systems, 2007.
前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發
ApplicationsApplicationswith Streaming RPCwith Streaming RPC
FeaturesFeaturesSupport RPCSupport RPC--level level abstractionabstractionDesign patterns for Design patterns for overlapping overlapping communication and communication and computationcomputationThreshold Threshold parameter for parameter for reducing amount of reducing amount of handhand--shakingshakingSupport Linux kernel Support Linux kernel 2.6.x 2.6.x Support two platformSupport two platform
PACPAC--EVBEVBTI OMAP 5912TI OMAP 5912
/ *Streaming RPC server* /void imdct ( ){
STREAM_ID id = 4 ;/* Initializing streaming channel* /stream_create( id ) ;/* Aggregating
from streaming channel*/stream_get( id , DATA) ;steram_pop( id ) ;…}
/ *Streaming RPC server* /void imdct ( ){
STREAM_ID id = 4 ;/* Initializing streaming channel* /stream_create( id ) ;/* Aggregating
from streaming channel*/stream_get( id , DATA) ;steram_pop( id ) ;…}
Example: Sample code of the streaming RPC implementation of an MP3 decoder
/* Streaming RPC client*/void MP3 decoder ( ){stream_rpc( imdct , transmitter) ;
}void transmitter( ){
STREAM_ID id = 4 ;/* Initializing streaming channel* / stream_create( id ) ;/* Pushing data
to streaming channel */stream_put( id , DATA) ;stream_push ( id ) ;…
}
前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發
Performance EvaluationPerformance EvaluationApplicationsApplications
JPEG decoder, resolution JPEG decoder, resolution of 317*255of 317*255MP3 decoder, 128k bitMP3 decoder, 128k bit--rate, 44100 sample raterate, 44100 sample rateQCIF H.264 decoderQCIF H.264 decoder
Streaming RPC improves Streaming RPC improves the average performance the average performance of the decoder by of the decoder by 30%30%
Application kernelsApplication kernelsIDCT of JPEG decoderIDCT of JPEG decoderIMDCT of MP3 decoderIMDCT of MP3 decoderIT/IQ of H.264 decoderIT/IQ of H.264 decoder
前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發
Dual-Core Toolkits整合DemoDual-core Multimedia Mobile Phone
Three multimedia decoder : JPEG, MP3, H.264One Simulated telecom decoding program (SIP)
Dual-OS environment with communication libraryMultiple tasks on DSP scheduled by pCore with priority-based policyWhile user making a phone call:
PAC DSP starts processing telecom decoding with highest priorityMedia decoding task is suspended and resumed at the end of phone calls
media thread
phone thread
media task
phone task
idle taskprogram startinitialize tasks Start media make a phone call media contiunes
AR
MD
SP
pllab.cs.nthu.edu.twpllab.cs.nthu.edu.twpllab.cs.nthu.edu.tw
經濟部學界開發產業技術計畫
雙核心嵌入式處理器架構技術虛擬平台-An Design Experience on
AndESLive
總計畫主持人 李政崑教授國立清華大學積體電路設計技術研發中心
前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發Binutils
Compiler
GDB
Libraries
PACDSP + Andes Toolchain
Andes+PAC:雙核心ESL模擬環境
完成工作完成工作
ESLESL模擬:模擬:完成完成PACDSP v3.0 PACDSP v3.0 sidsid IPIPPACDSP IPPACDSP IP與與AndESLiveAndESLive IPIP整合整合Share memoryShare memory及及interruptinterrupt溝通溝通
開發工具:開發工具:
PACDSP Compiler PACDSP Compiler PACDSP PACDSP BinutilsBinutils
進行工作進行工作
作業系統:作業系統:
Dual core OS: Linux 2.6 Dual core OS: Linux 2.6 kernel(Andeskernel(Andes) and PACDSP ) and PACDSP pCore(NTHUpCore(NTHU) )
pllab.cs.nthu.edu.twpllab.cs.nthu.edu.tw
AndesSCore™ + PACDSP Co-Simulation
pllab.cs.nthu.edu.twpllab.cs.nthu.edu.twDual Core – JPEG Demo
前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發
獲獎紀錄獲獎紀錄
李政崑教授PACDSP Compiler教育部嵌入式軟體競賽優勝2008/7
劉志尉教授The third place award for the Soft-IPUn-assign Topic in The Silicon Intellectual Property (SIP) Design Contest
教育部 SIP競賽 不定題組 佳作2006
項目 獲獎題目 指導老師
2006/1 ASPDAC Outstanding Design Award 52mW 1200 MIPS Compact DSP for Multi-core Media SoC
劉志尉教授
2006/11 Workshop on Consumer Electronics and Signal Processing 學生論文獎
Software implementation of MPEG-4 video decoder on the PACDSP platform
林大衛教授
2006/12 中華民國資訊學會碩博士最佳論文獎碩士論文獎佳作
降低漏電電流暨增進效能之集合關聯式快取記憶體李嘉哲
黃元欣教授
2006/12 中華民國資訊學會碩博士最佳論文獎碩士論文獎佳作
疊加網路上軟體元件遠端呼叫之資料串流支援楊智傑
李政崑教授
2007/7 旺宏金矽獎優勝 雙核心系統整合開發工具 李政崑教授
2007/7 教育部嵌入式軟體競賽優勝 PACDSP BIOS之設計與驗證 李政崑教授
2007/11 National Symposium on Telecommunications學生論文優等獎
Software implementation of MPEG-4 object-based video decoder on a dual-core digital signal processor platform
林大衛教授
2007/12/22 微電腦應用系統設計製作競賽 - 嵌入式軟體系統類 研究所組 第三名
Fast Motion Estimation in H.264 for PAC DSP 賴尚宏教授
2008/1 中華民國資訊學會碩博士最佳論文獎博士論文獎優等
低功率嵌入式處理器之編譯器最佳化研究游逸平
李政崑教授
2008/1 中華民國資訊學會碩博士最佳論文獎碩士論文獎優等
支援數位訊號處理器之微核心設計及雙核心開發環境黃建今
李政崑教授
前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發
CIC Distribution:PACDSP處理器開發工具整合套件
與STC聯合提供國家晶片系統設計中心(CIC) PACDSP 處理器開發工具整合套件
簽約日期:97.03.10簽約金額:2,000,000(15% income for NTHU MOEA Team)
項目DSP tool chain on PMP EVBC-library for PMP EVBIntegrated development environmentPMP Spec /課程大綱Training course /LAB軟體使用環境說明書週邊硬體使用者手冊DSP programmingPMP EVB overview
前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發
SunPlus SPCT6100
ESW平台開發機制
Platform技術處
Education教育部
Research國科會
Development工業局 基於台灣自有之處理
器核心建構自有平台與軟體之核心技術
培育自有平台與軟體之應用與研發人才
推動平台與軟體商品化之多元應用
透過開放軟體延伸平台之多元應用並研發性能提升之前瞻技術
國科會自由軟體與嵌入式系統學術研發應用科技計畫
計畫推動策略重視研發品質管理
軟體品質管理流程研討、(需求分析、系統設計、系統測試)三階段文件評鑑、成果公開評鑑
加強法人研究機構、公協會及產業界的參與及合作-黃金企鵝獎
強化學術團隊與自由軟體社群的交流
多元方式獎勵成果推廣與績優團隊
提供訓練課程、技術支援與諮詢
鼓勵使用嵌入式平台進行發展、研究以及嵌入式系統競賽之系統發展
推動共用平台開發產業聯盟(TEIA)
成立目的:推廣共用平台至「產、學、研」單位不定期舉行開發者技術分享研討會提供平台開發者交流網站
成立步驟邀請共用平台廠商制定聯盟章程廣邀國內嵌入式軟硬體廠商加入
規模負責聯盟事務
推動共用平台開發者產業鏈(建立eco-system)
系統整合開發
嵌入式軟體
嵌入式硬體IC設計
IC設計服務SIP供應商
嵌入式OS/軟體
設計服務
工業控制
網路
醫療
消費性電子
PC/PC周邊手機
汽車
國內知名廠商 國外知名廠商
威盛、凌陽、瑞昱、聯發科, etc
創意, 智原
晶心
智崴,系微, 滾雷
環隆電氣, 新華
威達電, 研華科技
合勤,零壹, 友訊
Agilent
MSI, Gigabyte
Acer, Asus
宏達電,啟基
裕隆, 中華
Qualcomm, Boradcom
Wipro
ARM, MIPS
Microsoft, Wind River, VenturCom, Quadros, QNX Software Sys,
Palm
Flextronics, Solectron
Kontron
3 com, CISCO
Omron
Sony, Apple
HP, DELL
Nokia, Motorola
Toyota, Benz, BMW
SummarySystem software will be playing significant roles with IC industries in Taiwan.Dual-core and multi-core solutions are now the norm in embedded systems.Introduction to STAR Processor IPs.We present an enabled flow for performing PAC Compiler and Toolkits.Our compiler could generate efficient codes for a set of DSP loop kernels