Upload
joella
View
45
Download
1
Embed Size (px)
DESCRIPTION
The Role Of ASIP In Programmable Platforms. Outline. Using ASIP – a new design paradigm EEMBC – a case study Designing ASIP using Xtensa and TIE Addressing the needs of platforms ASIP computing capabilities ASIP communication capabilities Challenges. A short story of - PowerPoint PPT Presentation
Citation preview
1
The Role Of ASIP In
Programmable Platforms
2
OutlineOutlineOutlineOutline
Using ASIP – a new design paradigm
EEMBC – a case study
Designing ASIP using Xtensa and TIE
Addressing the needs of platforms
ASIP computing capabilities
ASIP communication capabilities
Challenges
3
A short story ofa design paradigm shift
4
Once upon a timeOnce upon a timeOnce upon a timeOnce upon a time
How do I solve the encryption problem?
5
Data Encryption Standard (DES)Data Encryption Standard (DES)Data Encryption Standard (DES)Data Encryption Standard (DES)
Initial step(R, L) = Initial_permutation(Din64)
Iterate 16 timesKey generation
(C, D) = PC1(k)n = rotate_amount (function of iteration count)C = rotate_right(C, n)D = rotate_right (D, n)K = PC2(D, C)
EncryptionR i+1 = Li Permutation ( S_Box ( K Expansion ( R ) ) )L i+1 = Ri
Final stepDout64 = Final_permutation(L, R)
6
The SW engineer very proudly presentedThe SW engineer very proudly presentedThe SW engineer very proudly presentedThe SW engineer very proudly presented
static unsigned permute(unsigned char *table,in t n,unsigned hi,unsigned lo)
{int ib, ob;unsigned out = 0;for (ob = 0; ob < n; ob++) {
ib = table[ob] - 1;if (ib >= 32) { if (hi & (1 << (ib-32))) out |= 1 <<
ob;} else {
if (lo & (1 << ib)) out |= 1 << ob;}
}return out;
}
This code is fast
7
The HW engineer laughedThe HW engineer laughedThe HW engineer laughedThe HW engineer laughed
Initial step(R, L) = Initial_permutation(Din64)
Iterate 16 timesKey generation
(C, D) = PC1(k)n = rotate_amount (function of iteration count)C = rotate_right(C, n)D = rotate_right (D, n)K = PC2(D, C)
EncryptionR i+1 = Li Permutation ( S_Box ( K Expansion ( R ) ) )L i+1 = Ri
Final stepDout64 = Final_permutation(L, R)
200 cycles?I can do it in 1!!!
?
8
The HW engineer presentedThe HW engineer presentedThe HW engineer presentedThe HW engineer presented
Initial Permutation
ExpansionPermutation
S Boxes
P Permutation
Final Permutation
KeyGeneration
StateMachine
I’ll show you howfast it can be
9
The SW engineer laughedThe SW engineer laughedThe SW engineer laughedThe SW engineer laughed
Initial Permutation
ExpansionPermutation
S Boxes
P Permutation
Final Permutation
KeyGeneration
StateMachine
I can change this in1 minute, can you?
?
10
Realizing that they each had something the Realizing that they each had something the other wantedother wantedRealizing that they each had something the Realizing that they each had something the other wantedother wanted
If only I don’t have todesign the controller
If only I have just theinstruction I need
11
They decided to work togetherThey decided to work togetherThey decided to work togetherThey decided to work together
GETDATA ars, hilo
DES immediate
SETDATA ars, artInitial Permutation
ExpansionPermutation
S Boxes
P Permutation
Final Permutation
KeyGeneration
StateMachine
SETKEY ars, art
12
and improved the SW solution by 70xand improved the SW solution by 70xand improved the SW solution by 70xand improved the SW solution by 70x
SETKEY(K_hi, K_lo);for (;;) { … /* read encrypted data */ SETDATA(D_hi, D_lo); DES(DECRYPT1); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT1); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT1); DES(DECRYPT1); E_hi = GETDATA(hi); E_lo = GETDATA(lo); … /* write data */ }
SETKEY(K_hi, K_lo);for (;;) { … /* read data */ SETDATA(D_hi, D_lo); DES(ENCRYPT1); DES(ENCRYPT1); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT1); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT1); E_hi = GETDATA(hi); E_lo = GETDATA(lo); … /* write encrypted data */ }
DecryptionEncryption
13
When the boss asked how,When the boss asked how,the SW engineer said:the SW engineer said:When the boss asked how,When the boss asked how,the SW engineer said:the SW engineer said:
Registers
Datapath
Con
trol
SW Solution
Mem
ory
(Pro
gram
)
XCorrect Efficient
X
SW
14
and the HW engineer said:and the HW engineer said:and the HW engineer said:and the HW engineer said:
HW Solution
FSM Storage
X
Correct Efficient
X
HW
15
ASIP
Together, they had the best of both worldTogether, they had the best of both worldTogether, they had the best of both worldTogether, they had the best of both world
Registers
Datapath
Con
trol
SW Solutions HW Solutions
FSM Storage
Mem
ory
(Pro
gram
)
Correct EfficientSW
HW
16
The boss was very happy The boss was very happy The boss was very happy The boss was very happy
Optimality/integration
(e.g. mW, $)
Flexibility/modularity(e.g. time-to-market)
specialhardware
traditionalprocessors
+ SW
~
10x
~10x
ASIP
Use Softwarefor Control
Use Application-specific datapathfor computation
17
And they worked together happily ever And they worked together happily ever afterafterAnd they worked together happily ever And they worked together happily ever afterafter
18
OutlineOutlineOutlineOutline
Using ASIP – a new design paradigm
EEMBC – a case study
Designing ASIPs using Xtensa and TIE
Addressing the needs of platforms
ASIP computing capabilities
ASIP communication capabilities
Challenges
19
What Is “EEMBC”?What Is “EEMBC”?What Is “EEMBC”?What Is “EEMBC”?
EDN Embedded Microprocessor Benchmark Consortium
Pronounced “Embassy”
Non-profit consortium, funded by over 40 members
Including: ARM, AMD, IBM, Intel, LSI Logic, MIPS, Motorola, National Semi, NEC, TI, Toshiba, Tensilica, and more
Objective: Provide independently certified benchmark scores relevant to deeply embedded processor applications
Independent laboratory recreates and certifies all benchmark results - no tricks
20
EEMBC Benchmark SuitesEEMBC Benchmark SuitesEEMBC Benchmark SuitesEEMBC Benchmark Suites
Five different benchmark suites Consumer Networking Telecom Automotive Office Automation
Each suite comprised of a range (five to sixteen) ofbenchmarks representative of that product category Example: Consumer
• Image compression, image filtering, color conversion
21
Two Metrics: Out-of-box vs. OptimizedTwo Metrics: Out-of-box vs. OptimizedTwo Metrics: Out-of-box vs. OptimizedTwo Metrics: Out-of-box vs. Optimized
Out-of-Box Benchmark C code, no manual code optimization,
no assembly coding
Optimized, or “Full-Fury” Conventional Processors
• Laboriously hand-tuned assembly code• Rewriting C code to fit the architecture for VLIW or SIMD
machines• Changing Code to Fit the Processor
Xtensa• Optimized processor using Xtensa processor generator and TIE
Compiler • Changing Processor to Fit the Application!!
22
Xtensa Optimization ProcessXtensa Optimization ProcessXtensa Optimization ProcessXtensa Optimization Process
Step #1: Configure processor via generator GUI Compile C-code, evaluate results Modify configuration as needed “Out of Box” results measurement taken here
Step #2: Profile Code, Add TIE
Step #3: Optimize Code to Utilize TIE instructions “Optimized” results measured on final hardware configuration
Same Path Used by Tensilica Customers!
23
Optimized Xtensa Configurations for EEMBCOptimized Xtensa Configurations for EEMBCOptimized Xtensa Configurations for EEMBCOptimized Xtensa Configurations for EEMBC
OUT-OF-BOX
Configured Xtensa(Using GUI Click box options)
Unmodified C-Code
64.1K TIE127K total gates200MHz
25000 base gates +37600 config. gates
200MHz
OPTIMIZED
Configured XtensaPlus TIE Gates & Instructions
C-Code optimizations
62.6K
59K total gates200MHz
25000 base gates +25000 config. gates
200MHz50K
180K total gates200MHz
25000 base gates +37000 config Gates
200MHz
Consumer Configuration
Network Configuration
Telecom Configuration
9.2K TIE
VECTRA
18K
TIE
Illustrations conceptual, see EEBMC report for full details
24
EEMBC Consumer BenchmarkEEMBC Consumer BenchmarkEEMBC Consumer BenchmarkEEMBC Consumer Benchmark
0
20
40
60
80
100
120
140
160
180
200
Processors
Consumermark
Out-of-boxXtensa
OptimizedXtensa
25
EEMBC Consumer BenchmarkEEMBC Consumer BenchmarkEEMBC Consumer BenchmarkEEMBC Consumer Benchmark
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Processors
Consumermark / MHz
Out-of-boxXtensa
OptimizedXtensa
26
EEMBC Networking BenchmarkEEMBC Networking BenchmarkEEMBC Networking BenchmarkEEMBC Networking Benchmark
0
2
4
6
8
10
12
14
Processors
Netmark
Out-of-boxXtensa
OptimizedXtensa
AMD K6
27
EEMBC Networking BenchmarkEEMBC Networking BenchmarkEEMBC Networking BenchmarkEEMBC Networking Benchmark
0.000
0.005
0.010
0.015
0.020
0.025
0.030
0.035
0.040
0.045
Netmark / MHz
Out-of-boxXtensa
OptimizedXtensa
AMD K6
Processors
28
EEMBC Telecom BenchmarkEEMBC Telecom BenchmarkEEMBC Telecom BenchmarkEEMBC Telecom Benchmark
225.8
0
10
20
30
40
50
60
70
80
90
100
Processors
Telemark
Out-of-boxXtensa
OptimizedXtensa
BOPS 2x2
29
0.000
0.050
0.100
0.150
0.200
0.250
0.300
0.350
0.400
0.450
0.500
EEMBC Telecom BenchmarkEEMBC Telecom BenchmarkEEMBC Telecom BenchmarkEEMBC Telecom Benchmark
Processors
Telemark / MHz
Out-of-boxXtensa
OptimizedXtensa
BOPS 2x2
1.67
30
OutlineOutlineOutlineOutline
Using ASIP – a new design paradigm
EEMBC – a case study
Designing ASIPs using Xtensa and TIE
Addressing the needs of platforms
ASIP computing capabilities
ASIP communication capabilities
Challenges
31
ASIP Generation FlowASIP Generation FlowASIP Generation FlowASIP Generation Flow
Select processor options
Xtensa Processor Generator
ALU
Pipe
I/O
Timer
MMURegister File
Cache
Tailored,synthesizable HDL uP core
•Optimizing C/C++ Compiler•Cycle-accurate Simulator•Assembler•Linker•C/C++/asm/inst Debugger•RTOS
Describe newinstructions In Minutes!
32
Tensilica Instruction Extension (TIE) Lang.Tensilica Instruction Extension (TIE) Lang.Tensilica Instruction Extension (TIE) Lang.Tensilica Instruction Extension (TIE) Lang.
opcode PMAC op2=0 CUST0
state ACC1 40
state ACC2 40
iclass rr {PMAC}{in ars, in art}{inout ACC1, inout ACC2}
semantic pmac_sem {PMAC} {
assign ACC1 = ACC1 + ars[15:0] * art[15:0];
assign ACC2 = ACC2 + ars[31:16] * art[31:16];
}
schedule pmac_schd {PMAC} {
use ars 1; use art 1;
use ACC1 2; use ACC2 2;
def ACC1 2; def ACC2 2;
}
33
OutlineOutlineOutlineOutline
Using ASIP – a new design paradigm
EEMBC – a case study
Designing ASIP using Xtensa and TIE
Addressing the needs of platforms
ASIP computing capabilities
ASIP communication capabilities
Challenges
34
Sample platformsSample platformsSample platformsSample platforms
64-Bit64-Bit
32-Bit32-Bit
SDRAM Controller
SDRAM Controller
PCIInterface
PCIInterface
32-Bit32-Bit SRAMController
SRAMController
MicroengineMicroengine
StrongArmCore
(166 MHz)
StrongArmCore
(166 MHz)
168KBInstruction
Cache
8KBData Cache
1KB Mini-Data Cache
168KBInstruction
Cache
8KBData Cache
1KB Mini-Data Cache
HashEngine
IX BusInterface
ScratchPad
SRAM
HashEngine
IX BusInterface
ScratchPad
SRAM
MicroengineMicroengine
MicroengineMicroengine
MicroengineMicroengine
MicroengineMicroengine
MicroengineMicroengine
64-Bit64-Bit
64-Bit64-Bit
32-Bit32-Bit
SDRAM Controller
SDRAM Controller
PCIInterface
PCIInterface
32-Bit32-Bit SRAMController
SRAMController
MicroengineMicroengine
StrongArmCore
(166 MHz)
StrongArmCore
(166 MHz)
168KBInstruction
Cache
8KBData Cache
1KB Mini-Data Cache
168KBInstruction
Cache
8KBData Cache
1KB Mini-Data Cache
HashEngine
IX BusInterface
ScratchPad
SRAM
HashEngine
IX BusInterface
ScratchPad
SRAM
MicroengineMicroengine
MicroengineMicroengine
MicroengineMicroengine
MicroengineMicroengine
MicroengineMicroengine
64-Bit64-Bit
Network Network Processor Processor
ArchitectureArchitecture
Network Network Processor Processor
ArchitectureArchitecture
Intel IXP1200 Vitesse PRISM IQ2000
Motorola C-Port CDP C-5 PMC-Sierra VoIP Gateway
35
ObservationsObservationsObservationsObservations
Heterogeneous processing elements
General purpose processors
Micro-controllers
Dedicated blocks
Heterogeneous communication links
Bandwidth
Latency
Hardware overhead
Communication overhead
36
Two Legs Of Platform DesignTwo Legs Of Platform DesignTwo Legs Of Platform DesignTwo Legs Of Platform Design
64-Bit64-Bit
32-Bit32-Bit
SDRAM Controller
SDRAM Controller
PCIInterface
PCIInterface
32-Bit32-Bit SRAMController
SRAMController
MicroengineMicroengine
StrongArmCore
(166 MHz)
StrongArmCore
(166 MHz)
168KBInstruction
Cache
8KBData Cache
1KB Mini-Data Cache
168KBInstruction
Cache
8KBData Cache
1KB Mini-Data Cache
HashEngine
IX BusInterface
ScratchPad
SRAM
HashEngine
IX BusInterface
ScratchPad
SRAM
MicroengineMicroengine
MicroengineMicroengine
MicroengineMicroengine
MicroengineMicroengine
MicroengineMicroengine
64-Bit64-Bit
64-Bit64-Bit
32-Bit32-Bit
SDRAM Controller
SDRAM Controller
PCIInterface
PCIInterface
32-Bit32-Bit SRAMController
SRAMController
MicroengineMicroengine
StrongArmCore
(166 MHz)
StrongArmCore
(166 MHz)
168KBInstruction
Cache
8KBData Cache
1KB Mini-Data Cache
168KBInstruction
Cache
8KBData Cache
1KB Mini-Data Cache
HashEngine
IX BusInterface
ScratchPad
SRAM
HashEngine
IX BusInterface
ScratchPad
SRAM
MicroengineMicroengine
MicroengineMicroengine
MicroengineMicroengine
MicroengineMicroengine
MicroengineMicroengine
64-Bit64-Bit
64-Bit64-Bit
32-Bit32-Bit
SDRAM Controller
SDRAM Controller
PCIInterface
PCIInterface
32-Bit32-Bit SRAMController
SRAMController
MicroengineMicroengine
StrongArmCore
(166 MHz)
StrongArmCore
(166 MHz)
168KBInstruction
Cache
8KBData Cache
1KB Mini-Data Cache
168KBInstruction
Cache
8KBData Cache
1KB Mini-Data Cache
HashEngine
IX BusInterface
ScratchPad
SRAM
HashEngine
IX BusInterface
ScratchPad
SRAM
MicroengineMicroengine
MicroengineMicroengine
MicroengineMicroengine
MicroengineMicroengine
MicroengineMicroengine
64-Bit64-Bit
64-Bit64-Bit
32-Bit32-Bit
SDRAM Controller
SDRAM Controller
PCIInterface
PCIInterface
32-Bit32-Bit SRAMController
SRAMController
MicroengineMicroengine
StrongArmCore
(166 MHz)
StrongArmCore
(166 MHz)
168KBInstruction
Cache
8KBData Cache
1KB Mini-Data Cache
168KBInstruction
Cache
8KBData Cache
1KB Mini-Data Cache
HashEngine
IX BusInterface
ScratchPad
SRAM
HashEngine
IX BusInterface
ScratchPad
SRAM
MicroengineMicroengine
MicroengineMicroengine
MicroengineMicroengine
MicroengineMicroengine
MicroengineMicroengine
64-Bit64-Bit
ProcessingElementDesign
CommunicationDesign
Platform Designer
37
OutlineOutlineOutlineOutline
Using ASIP – a new design paradigm
EEMBC – a case study
Designing ASIP using Xtensa and TIE
Addressing the needs of platforms
ASIP computing capabilities
ASIP communication capabilities
Challenges
38
ASIP requirementsASIP requirementsASIP requirementsASIP requirements
Match the performance of hard-wired logic
Offer variety of performance/cost points
Easy to design
Easy to useOptimality/integration
(e.g. mW, $)
Flexibility/modularity(e.g. time-to-market)
specialhardware
traditionalprocessors
+ SW
~
10x
~10x
ASIP
Use Softwarefor Control
Use Application-specific datapathfor computation
Optimality/integration
(e.g. mW, $)
Flexibility/modularity(e.g. time-to-market)
specialhardware
traditionalprocessors
+ SW
~
10x
~10x
ASIP
Use Softwarefor ControlUse Softwarefor Control
Use Application-specific datapathfor computation
64-Bit64-Bit
32-Bit32-Bit
SDRAM Controller
SDRAM Controller
PCIInterface
PCIInterface
32-Bit32-Bit SRAMController
SRAMController
MicroengineMicroengine
StrongArmCore
(166 MHz)
StrongArmCore
(166 MHz)
168KBInstruction
Cache
8KBData Cache
1KB Mini-Data Cache
168KBInstruction
Cache
8KBData Cache
1KB Mini-Data Cache
HashEngine
IX BusInterface
ScratchPad
SRAM
HashEngine
IX BusInterface
ScratchPad
SRAM
MicroengineMicroengine
MicroengineMicroengine
MicroengineMicroengine
MicroengineMicroengine
MicroengineMicroengine
64-Bit64-Bit
64-Bit64-Bit
32-Bit32-Bit
SDRAM Controller
SDRAM Controller
PCIInterface
PCIInterface
32-Bit32-Bit SRAMController
SRAMController
MicroengineMicroengine
StrongArmCore
(166 MHz)
StrongArmCore
(166 MHz)
168KBInstruction
Cache
8KBData Cache
1KB Mini-Data Cache
168KBInstruction
Cache
8KBData Cache
1KB Mini-Data Cache
HashEngine
IX BusInterface
ScratchPad
SRAM
HashEngine
IX BusInterface
ScratchPad
SRAM
MicroengineMicroengine
MicroengineMicroengine
MicroengineMicroengine
MicroengineMicroengine
MicroengineMicroengine
64-Bit64-Bit
39
Fixed Processors Cannot Replace ASICFixed Processors Cannot Replace ASICFixed Processors Cannot Replace ASICFixed Processors Cannot Replace ASIC
Source
RF0
FU0
Result
Decoder
Co
ntr
ol
Temporal bottleneck:Limited functionality
Spatial bottleneck:not enough bandwidth
40
Adding Customized Function Units to Break Adding Customized Function Units to Break Temporal BottleneckTemporal BottleneckAdding Customized Function Units to Break Adding Customized Function Units to Break Temporal BottleneckTemporal Bottleneck
Source routing
RF0
FU0 FU1 FU2 FU3
Result routing
Decoder
Co
ntr
ol
FSM StorageFSM Storage
41
Example of Customized Functional UnitExample of Customized Functional UnitExample of Customized Functional UnitExample of Customized Functional Unit
opcode PMAC op2=0 CUST0
state ACC1 40
state ACC2 40
iclass rr {PMAC}{in ars, in art}{inout ACC1, inout ACC2}
semantic pmac_sem {PMAC} {
assign ACC1 = ACC1 + ars[15:0] * art[15:0];
assign ACC2 = ACC2 + ars[31:16] * art[31:16];
}
schedule pmac_schd {PMAC} {
use ars 1; use art 1;
use ACC1 2; use ACC2 2;
def ACC1 2; def ACC2 2;
}
42
Effectiveness of Customized Functional UnitEffectiveness of Customized Functional UnitEffectiveness of Customized Functional UnitEffectiveness of Customized Functional Unit
Requirements:
Performance - similar
Cost - similar
Ease of design – similar
TIE: assign ACC1 = ACC1 + ars[15:0] * art[15:0];
Ease of use – much easier
C: PMAC(x, y);
43
Adding Processor States to Break Spatial Adding Processor States to Break Spatial Bottleneck Bottleneck Adding Processor States to Break Spatial Adding Processor States to Break Spatial Bottleneck Bottleneck
Source routing
RF0 S1S0
FU0 FU1 FU2 FU3
Result routing
Decoder
Co
ntr
ol
FSM StorageFSM Storage
44
Example of Processor States Example of Processor States Example of Processor States Example of Processor States
opcode PMAC op2=0 CUST0
state ACC1 40
state ACC2 40
iclass rr {PMAC}{in ars, in art}{inout ACC1, inout ACC2}
semantic pmac_sem {PMAC} {
assign ACC1 = ACC1 + ars[15:0] * art[15:0];
assign ACC2 = ACC2 + ars[31:16] * art[31:16];
}
schedule pmac_schd {PMAC} {
use ars 1; use art 1;
use ACC1 2; use ACC2 2;
def ACC1 2; def ACC2 2;
}
45
Effectiveness of Processor StatesEffectiveness of Processor StatesEffectiveness of Processor StatesEffectiveness of Processor States
Requirements:
Performance – better
Especially when used with pipelined functional units
Cost – higher due to pipelined implementation
Ease of design – very simple
state ACC1 40
Ease of use – very easy
PMAC(x, y); /* implicitly using the states */
x = R_ACC1_Lo(); W_ACC1_Hi(y);
46
Sharing States Using Register FilesSharing States Using Register FilesSharing States Using Register FilesSharing States Using Register Files
Source routing
RF0 RF1 RF2 S1S0
FU0 FU1 FU2 FU3
Result routing
Decoder
Co
ntr
ol
FSM StorageFSM Storage
47
Example of a Register FileExample of a Register FileExample of a Register FileExample of a Register File
Co
ntr
ol
regfile RF24 24 16 r
operand vs s {RF24[s]}
operand vt t {RF24[t]}
operand vr r {RF24[r]}
iclass rrr {average} {out vr, in vs, in vt}
reference average {
wire [8:0] t2 = vs[23:16] + vt[23:16];
wire [8:0] t1 = vs[15:8] + vt[15:8];
wire [8:0] t0 = vs[7:0] + vt[7:0];
assign vr = {t2[8:1], t1[8:1], t0[8:1]};
}
ctype rgb 24 32 RF24
48
Crossing the HW/SW BoundaryCrossing the HW/SW BoundaryCrossing the HW/SW BoundaryCrossing the HW/SW Boundary
Working with typed data:
rgb x, y, z; /* C code */
Letting C-Compiler allocate the registers
z = average(x, y); /* assembly: average v1, v4, v6 */
Letting C-Compiler spill the registers
Letting C-Compiler convert to/from other types
yuv a, b;
b = average (a, y);
Auto saved/restored on context switching
49
Effectiveness of Register FileEffectiveness of Register FileEffectiveness of Register FileEffectiveness of Register File
Requirements:
Performance – better
Especially when used with pipelined functional units
Cost – higher due to pipelined implementation
Ease of design – very simple
regfile RF24 24 16 r
Ease of use – very easyrgb x, y, z;z = average(x, y);
50
Multi-cycle InstructionsMulti-cycle InstructionsMulti-cycle InstructionsMulti-cycle Instructions
Source routing
RF0 RF1 RF2 S1S0
FU0 FU1 FU2 FU3
Result routing
Decoder
Co
ntr
ol
FSM StorageFSM Storage
51
Example of a Multi-cycle InstructionExample of a Multi-cycle InstructionExample of a Multi-cycle InstructionExample of a Multi-cycle Instruction
opcode PMAC op2=0 CUST0
state ACC1 40
state ACC2 40
iclass rr {PMAC}{in ars, in art}{inout ACC1, inout ACC2}
semantic pmac_sem {PMAC} {
assign ACC1 = ACC1 + ars[15:0] * art[15:0];
assign ACC2 = ACC2 + ars[31:16] * art[31:16];
}
schedule pmac_schd {PMAC} {
use ars 1; use art 1;
use ACC1 2; use ACC2 2;
def ACC1 2; def ACC2 2;
}
ars art
ACC1ACC2
52
Effectiveness of Multi-cycle InstructionsEffectiveness of Multi-cycle InstructionsEffectiveness of Multi-cycle InstructionsEffectiveness of Multi-cycle Instructions
Requirements:
Performance – usually better
difficult in hard-wired logic
Cost – higher due to bypass and interlock logic
Ease of design – very simple
use arr 3;
Ease of use – very easy and optimized by C Compiler
t = sat_mult(x,y);z = sat_add(z, t);t2 = sat_mult(x2, y2);
sat_mult s3, s1, s2 sat_mult s6, s5, s4sat_add s7, s7, s3
53
Replacing the State MachineReplacing the State MachineReplacing the State MachineReplacing the State Machine
Source routing
RF0 RF1 RF2 S1S0
FU0 FU1 FU2 FU3
Result routing
Decoder
Co
ntr
ol
FSM StorageFSM Storage
program
54
Effectiveness of Control ProgrammingEffectiveness of Control ProgrammingEffectiveness of Control ProgrammingEffectiveness of Control Programming
Requirements:
Performance – comparable
0-overhead loop, branch prediction, scheduling
Cost – comparable
Ease of design – very simple
reference BT {…, assign BranchTarget = …; …}
Ease of use – very easywhileforif then elseswitchgotofunction call
55
Short Summary of ASIP Computing Short Summary of ASIP Computing CapabilityCapabilityShort Summary of ASIP Computing Short Summary of ASIP Computing CapabilityCapability
ASIP:
Performance – comparable
Cost – higher due to pipelined implementation
Ease of design – easy using Xtensa/TIE
Ease of use – very easy using optimizing compiler
64-Bit64-Bit
32-Bit32-Bit
SDRAM Controller
SDRAM Controller
PCIInterface
PCIInterface
32-Bit32-Bit SRAMController
SRAMController
MicroengineMicroengine
StrongArmCore
(166 MHz)
StrongArmCore
(166 MHz)
168KBInstruction
Cache
8KBData Cache
1KB Mini-Data Cache
168KBInstruction
Cache
8KBData Cache
1KB Mini-Data Cache
HashEngine
IX BusInterface
ScratchPad
SRAM
HashEngine
IX BusInterface
ScratchPad
SRAM
MicroengineMicroengine
MicroengineMicroengine
MicroengineMicroengine
MicroengineMicroengine
MicroengineMicroengine
64-Bit64-Bit
64-Bit64-Bit
32-Bit32-Bit
SDRAM Controller
SDRAM Controller
PCIInterface
PCIInterface
32-Bit32-Bit SRAMController
SRAMController
MicroengineMicroengine
StrongArmCore
(166 MHz)
StrongArmCore
(166 MHz)
168KBInstruction
Cache
8KBData Cache
1KB Mini-Data Cache
168KBInstruction
Cache
8KBData Cache
1KB Mini-Data Cache
HashEngine
IX BusInterface
ScratchPad
SRAM
HashEngine
IX BusInterface
ScratchPad
SRAM
MicroengineMicroengine
MicroengineMicroengine
MicroengineMicroengine
MicroengineMicroengine
MicroengineMicroengine
64-Bit64-Bit
56
Meet the Communication RequirementsMeet the Communication RequirementsMeet the Communication RequirementsMeet the Communication Requirements
64-Bit64-Bit
32-Bit32-Bit
SDRAM Controller
SDRAM Controller
PCIInterface
PCIInterface
32-Bit32-Bit SRAMController
SRAMController
MicroengineMicroengine
StrongArmCore
(166 MHz)
StrongArmCore
(166 MHz)
168KBInstruction
Cache
8KBData Cache
1KB Mini-Data Cache
168KBInstruction
Cache
8KBData Cache
1KB Mini-Data Cache
HashEngine
IX BusInterface
ScratchPad
SRAM
HashEngine
IX BusInterface
ScratchPad
SRAM
MicroengineMicroengine
MicroengineMicroengine
MicroengineMicroengine
MicroengineMicroengine
MicroengineMicroengine
64-Bit64-Bit
64-Bit64-Bit
32-Bit32-Bit
SDRAM Controller
SDRAM Controller
PCIInterface
PCIInterface
32-Bit32-Bit SRAMController
SRAMController
MicroengineMicroengine
StrongArmCore
(166 MHz)
StrongArmCore
(166 MHz)
168KBInstruction
Cache
8KBData Cache
1KB Mini-Data Cache
168KBInstruction
Cache
8KBData Cache
1KB Mini-Data Cache
HashEngine
IX BusInterface
ScratchPad
SRAM
HashEngine
IX BusInterface
ScratchPad
SRAM
MicroengineMicroengine
MicroengineMicroengine
MicroengineMicroengine
MicroengineMicroengine
MicroengineMicroengine
64-Bit64-Bit
64-Bit64-Bit
32-Bit32-Bit
SDRAM Controller
SDRAM Controller
PCIInterface
PCIInterface
32-Bit32-Bit SRAMController
SRAMController
MicroengineMicroengine
StrongArmCore
(166 MHz)
StrongArmCore
(166 MHz)
168KBInstruction
Cache
8KBData Cache
1KB Mini-Data Cache
168KBInstruction
Cache
8KBData Cache
1KB Mini-Data Cache
HashEngine
IX BusInterface
ScratchPad
SRAM
HashEngine
IX BusInterface
ScratchPad
SRAM
MicroengineMicroengine
MicroengineMicroengine
MicroengineMicroengine
MicroengineMicroengine
MicroengineMicroengine
64-Bit64-Bit
64-Bit64-Bit
32-Bit32-Bit
SDRAM Controller
SDRAM Controller
PCIInterface
PCIInterface
32-Bit32-Bit SRAMController
SRAMController
MicroengineMicroengine
StrongArmCore
(166 MHz)
StrongArmCore
(166 MHz)
168KBInstruction
Cache
8KBData Cache
1KB Mini-Data Cache
168KBInstruction
Cache
8KBData Cache
1KB Mini-Data Cache
HashEngine
IX BusInterface
ScratchPad
SRAM
HashEngine
IX BusInterface
ScratchPad
SRAM
MicroengineMicroengine
MicroengineMicroengine
MicroengineMicroengine
MicroengineMicroengine
MicroengineMicroengine
64-Bit64-Bit
ProcessingElementDesign
CommunicationDesign
Platform Designer
57
Ways for ASIP to CommunicateWays for ASIP to CommunicateWays for ASIP to CommunicateWays for ASIP to Communicate
Functional Units
Regfiles State
Load/Store Units
I-RAM D-RAMI-Cache D-Cache
Processor Interface (PIF)E
xter
nal
In
terf
ace
MEM Device ASIP
Interrupt
58
Communicate Via PIF and Shared MemoryCommunicate Via PIF and Shared MemoryCommunicate Via PIF and Shared MemoryCommunicate Via PIF and Shared Memory
Functional Units
Regfiles States
Load/Store Unit
I-RAM D-RAMI-Cache D-Cache
Processor Interface (PIF)
Ext
ern
al I
nte
rfac
e
MEM Device ASIPPros:
•Simple•Low cost•Standard
Cons:•Long latency•Limited by PIF width•Resource contention•Polling
Interrupt
59
Communicate Via InterruptsCommunicate Via InterruptsCommunicate Via InterruptsCommunicate Via Interrupts
Functional Units
Regfiles States
Load/Store Unit
I-RAM D-RAMI-Cache D-Cache
Processor Interface (PIF)
Ext
ern
al I
nte
rfac
e
MEM Device ASIPPros:
•Simple•low cost•Standard•Event driven
Cons:•Very low bandwidth
Interrupt
60
Communicate Via Dual-ported Local Communicate Via Dual-ported Local MemoryMemoryCommunicate Via Dual-ported Local Communicate Via Dual-ported Local MemoryMemory
Functional Units
Regfiles States
Load/Store Unit
I-RAM D-RAMI-Cache D-Cache
Processor Interface (PIF)
Ext
ern
al I
nte
rfac
e
MEM Device ASIPPros:
•FastCons:
•High cost•Special programming•Limited bandwidth
Interrupt
61
Communicate Via Local Memory PortCommunicate Via Local Memory PortCommunicate Via Local Memory PortCommunicate Via Local Memory Port
Functional Units
Regfiles States
Load/Store Unit
I-RAM D-RAMI-Cache D-Cache
Processor Interface (PIF)
Ext
ern
al I
nte
rfac
e
MEM Device ASIPPros:
•Configurable•Low latency•Low cost
Cons:•Non-standard•Limited bandwidth•Special programming•External HW design•Expose to ASIP pipeline
Interrupt
62
Communicate Via Processor StatesCommunicate Via Processor StatesCommunicate Via Processor StatesCommunicate Via Processor States
Functional Units
Regfiles States
Load/Store Unit
I-RAM D-RAMI-Cache D-Cache
Processor Interface (PIF)
Ext
ern
al I
nte
rfac
e
MEM Device ASIPPros:
•Highly configurable•Low latency•Low cost•High bandwidth
Cons:•Non-standard•Special programming•One-way•Restricted to level signal•External HW design
Interrupt
63
Communicate Via InstructionsCommunicate Via InstructionsCommunicate Via InstructionsCommunicate Via Instructions
Functional Units
Regfiles States
Load/Store Unit
I-RAM D-RAMI-Cache D-Cache
Processor Interface (PIF)
Ext
ern
al I
nte
rfac
e
MEM Device ASIPPros:
•Highly configurable•No latency•Very low cost•High bandwidth
Cons:•Non-standard•Special programming•Restricted to edge signal•External HW design•Expose to ASIP pipeline
Interrupt
64
OutlineOutlineOutlineOutline
Using ASIP – a new design paradigm
EEMBC – a case study
Designing ASIP using Xtensa and TIE
Addressing the needs of platforms
ASIP computing capabilities
ASIP communication capabilities
Challenges
65
ASIP ChallengesASIP ChallengesASIP ChallengesASIP Challenges
Optimality/integration
(e.g. mW, $)
Flexibility/modularity(e.g. time-to-market)
specialhardware
traditionalprocessors
+ SW
~
10
x
~10x
ASIP
Use Softwarefor Control
Use Application-specific datapathfor computation
Optimality/integration
(e.g. mW, $)
Flexibility/modularity(e.g. time-to-market)
specialhardware
traditionalprocessors
+ SW
~
10
x
~10x
ASIP
Use Softwarefor ControlUse Softwarefor Control
Use Application-specific datapathfor computation
Balance computation and communication
Performance, cost, power
Choose the right instructions
Flexibility, product longevity, ease of programming
Let HW engineers design ASIP
No FSMs!
Let SW engineers design ASIP
Efficient functional units!
Support variety of communication
Separation of platform designs and system designs
64-Bit64-Bit
32-Bit32-Bit
SDRAM Controller
SDRAM Controller
PCIInterface
PCIInterface
32-Bit32-Bit SRAMController
SRAMController
MicroengineMicroengine
StrongArmCore
(166 MHz)
StrongArmCore
(166 MHz)
168KBInstruction
Cache
8KBData Cache
1KB Mini-Data Cache
168KBInstruction
Cache
8KBData Cache
1KB Mini-Data Cache
HashEngine
IX BusInterface
ScratchPad
SRAM
HashEngine
IX BusInterface
ScratchPad
SRAM
MicroengineMicroengine
MicroengineMicroengine
MicroengineMicroengine
MicroengineMicroengine
MicroengineMicroengine
64-Bit64-Bit
64-Bit64-Bit
32-Bit32-Bit
SDRAM Controller
SDRAM Controller
PCIInterface
PCIInterface
32-Bit32-Bit SRAMController
SRAMController
MicroengineMicroengine
StrongArmCore
(166 MHz)
StrongArmCore
(166 MHz)
168KBInstruction
Cache
8KBData Cache
1KB Mini-Data Cache
168KBInstruction
Cache
8KBData Cache
1KB Mini-Data Cache
HashEngine
IX BusInterface
ScratchPad
SRAM
HashEngine
IX BusInterface
ScratchPad
SRAM
MicroengineMicroengine
MicroengineMicroengine
MicroengineMicroengine
MicroengineMicroengine
MicroengineMicroengine
64-Bit64-Bit