View
221
Download
0
Tags:
Embed Size (px)
Citation preview
PhilipsResearch
ECLIPSEECLIPSEExtended CPU Local Irregular Processing StructureExtended CPU Local Irregular Processing Structure
ECLIPSEECLIPSEExtended CPU Local Irregular Processing StructureExtended CPU Local Irregular Processing Structure
ISTE. van Utteren
ISTE. van Utteren
IPAW.J. Lippmann
IPAW.J. Lippmann
PROMMPTJ.T.J. v. Eijndhoven
PROMMPTJ.T.J. v. Eijndhoven
DD&TC. Niessen
DD&TC. Niessen
ESASA. van der Werf
ESASA. van der Werf
ViPsG. Depovere
ViPsG. Depovere
ITE. Dijkstra
ITE. Dijkstra
DS & PCA. van Gorkum
DS & PCA. van Gorkum
IC DesignG. Beenker
IC DesignG. Beenker
ECLIPSEECLIPSE CPUCPU
AV & MSTh. Brouste
AV & MSTh. Brouste
LEP, HVET. Doyle
LEP, HVET. Doyle
2
PhilipsResearch
?
DVP: design problem
Nexperia mediaprocessors
3
PhilipsResearch
DVP: application domain
• High volume consumer electronics productsfuture TV, home theatre, set-top box, etc.
• Media processing:audio, video, graphics, communication
4
PhilipsResearch
DVP: SoC platform
• Nexperia line of media processors for mid- to high-end consumer media processing systems is based on DVP
• DVP provides template for System-on-a-Chip
• DVP supports families of evolving products
• DVP is part of corporate HVE strategy
5
PhilipsResearch
DVP: system requirements
• High degree of flexibility, extendability and scalability– unknown applications
– new standards
– new hardware blocks
• High level of media processing power– hardware coprocessor support
6
PhilipsResearch
DVP: architecture philosophy
• High degree of flexibility is achieved by supporting media processing in software
• High performance is achieved by providing specialized hardware coprocessors
• Problem: How to mix & match hardware based and software based media processing?
7
PhilipsResearch
DVP: model of computation
ProcessFIFO Read
Write
A C
B Execute
Model of computation is Kahn Process Networks:
• The Kahn model allows ‘plug and play’:• Parallel execution of many tasks
• Configures different applications by instantiating and connecting tasks
• Maintains functional correctness independent of task scheduling issues
• TSSA: API to transform C programs into Kahn models
8
PhilipsResearch
DVP: model of computation
CPU coproc1 coproc2
Application- parallel tasks- streams
Mapping- static
Architecture- programmable graph
9
PhilipsResearch
DVP: architecture philosophy
• Kahn processes (nodes) are mapped onto (co)processors
• Communication channels (graph edges) are mapped onto buffers in centralized memory
• Scheduling and synchronization (notification & handling of empty or full buffers) is performed by control software
• Communication pattern between modules (data flow graph) is freely programmable
10
PhilipsResearch
DVP: generic architecture
• Shared, single address space, memory model• Flexible access
• Transparent programming model
• Physically centralized random access memory• Flexible buffer allocation
• Fits well with stream processing
• Single memory-bus for communication• Simple and cost effective
11
PhilipsResearch
DVP: example architecture instantiation
VLIWcpu
I$
video-in
video-out
audio-in
SDRAM
audio-out
PCI bridge
Serial I/O
timers I2C I/O
D$
MIPScpu
I$
D$Imagescaler
12
PhilipsResearch
DVP: TSSA abstraction layer
TM-CPU software
Traditional coarse-grain TM co-processors
TSSA stream data, buffered in off-chip SDRAM,synchronization with CPU interrupts
TSSA-OS
TSSA-Appl1 TSSA-Appl2
13
PhilipsResearch
DVP: TSSA abstraction layer
• Hides implementation details:• graph setup
• buffer synchronization
• Runs on pSOS (and other RTKs)
• Provides standard API
• Defines standard data formats
14
PhilipsResearch
Outline
• DVP
• Eclipse DVP subsystem
• Eclipse architecture
• Eclipse application programming
• Simulator
• Status
15
PhilipsResearch
Eclipse DVP subsystem
Objective
Increase flexibility of DVP systems, while maintaining cost-performance.
Customer• Semiconductors: Consumer Systems (Transfer to TTI)
• Consumer Electronics: Domain 2 (BG-TV Brugge)
• Research
ProductsMid- to high-end DVP / TSSA systems: DTVs and STBs
16
PhilipsResearch
Eclipse DVP subsystem: design problem
• Increase application flexibility through re-use of medium-grain function blocks, in HW and SW
• Keep streaming data on-chip
But ?
• More bandwidth visible
• Limited memory size
• High synchronization rate
• CPU unfriendly
SDRAM
HDVO condor
MPEG CPU
DVP/TSSA system:
• Coarse-grain ‘solid’ function blocks(reuse, HWSW ?)
• Stream data buffered in off-chip memory(bandwidth, power ?)
17
PhilipsResearch
Design problem: new DVP subsystem
VO
MPEG2decode
CPU
1394DVDdecode
MPEG2encode CPU
Eclipse
externalmemory
18
PhilipsResearch
Eclipse DVP subsystem: application domain
Now, target for 1st instance:
• Dual MPEG2 full HD decode (1920 x 1080 @ 60i)
• MPEG2 SD transcoding and HD decoding
Anticipate:
• Range of formats (DV, MJPEG, MPEG4)
• 3D-graphics acceleration
• Motion-compensated video processing
19
PhilipsResearch
Application domain: MPEG2 decoding (HD)
+
Motioncompenstion
Referencepictures
Zig-zag scan
Run lengthdecoding
Variable lengthdecoding
Inversequantization
InverseDCT
HD Video141 MB/s
< 10 MB/s
MPEG2 HD Bitstream
141 MB/s 106 MB/s
94 MB/s
> 221 MB/s< 407 MB/s
saturate
94 MB/s
8 MB/s
20
PhilipsResearch
Application domain: MPEG2 encoding (SD)
DCT+ Quantization Zig-zag scan
Run lengthencoding
Variable lengthencoding
Referencepictures
+
Motioncompenstion
Inversequantization
InverseDCT
-
SD Video
19 MB/s
19 MB/s
19 MB/s
21 MB/s 28 MB/s 28 MB/s
28 MB/s
28 MB/s
<1.9 MB/s1.6 MB/sN 2́8 - N 5́3 MB/s12-25 MB/s
44-81MB/s
motionvectors
21 MB/s
Picturere-order
Motionestimation
SD MPEG2Bitstream
21
PhilipsResearch
Application domain: MPEG-4 video decoding
ReferencePicturesReferencePicturesReferencePictures
InverseScan
VariableLength
DecodingIDCT
MotionComp.
MVDecoder
ReferencePictures
InverseQuantization
PictureReconst.
MPEG-4 ES
DC & ACPrediction
Context ArithmeticDecoding
ShapeMotion
Compensation
Shape MVPrediction
<384
800
128
9090
<220
90
90
900.1
90
<7
22
PhilipsResearch
Sandra
Eclipse
CPU
MPEG-4: system level application partitioning
Composition and rendering
Videoobject
3D Gfxobject
Audioobject
De-multiplex
Scene description
Decompression
Network layer
23
PhilipsResearch
VO(SANDRA)
MPEG-4: partitioning Eclipse - SANDRA
SDRAM
MMI
MediaCPU
D$
I$
SRAMVLD
DCT
MC
VI
MBS
Eclipse
24
PhilipsResearch
Eclipse DVP subsystem: current TSSA style
TM-CPU software
Traditional coarse-grain TM co-processors
TSSA stream data, buffered in off-chip SDRAM,synchronization with CPU interrupts
TSSA
TSSA-Appl1 TSSA-Appl2
25
PhilipsResearch
Eclipse DVP subsystem: Eclipse tasks embedded in TSSA
TSSA
TSSA-Appl1 TSSA-Appl2
Eclipse Driver
Eclipse task on HW
Eclipse task in SW
Eclipse data streamvia on-chip memory
TSSA task on Eclipse
TSSA task in SW
TSSA task on DVP HW
TSSA data streamvia off-chip memory
26
PhilipsResearch
Eclipse DVP subsystem: scale down
Hierarchy in the DVP system:
• Computational model which fits neatly inside DVP & TSSA
Scale down from SoC to subsystem:
• Limited internal distances
• High data bandwidth and local storage
• Fast inter-task synchronization
27
PhilipsResearch
Outline
• DVP
• Eclipse DVP subsystem
• Eclipse architecture• Model of computation
• Generic architecture
• Eclipse application programming
• Simulator
• Status
28
PhilipsResearch
Eclipse architecture: model of computation
CPU coproc1 coproc2
Application- parallel tasks- streams
Mapping- static
Architecture- programmable- medium grain- multitasking
29
PhilipsResearch
Model of computation: architecture philosophy
The Kahn model allows ‘plug and play’:
• Parallel execution of many tasks
• Application configuration by instantiating and connecting tasks.
• Functional correctness independent of task scheduling issues.
Eclipse is designed to accomplish this with:
• A mixture of HW and SW tasks.
• High data rates (GB/s) and medium buffer sizes (KB).
• Re-use of co-processors over applications through multi-tasking
• Runtime application reconfiguration.
30
PhilipsResearch
Allow proper balance in HW/SW combination
Function-specificengines
DSP-CPU
Application flexibility of given siliconLow High
Energyefficiency
Low
HighEclipse
31
PhilipsResearch
Previous Kahn style architectures in PRLE
CPACPA C-HeapC-Heap
EclipseEclipse
Explicit synchronizationShared memory model
Mixed HW/SW
Data drivenHW synchronizationMultitasking coprocs
But ?Dynamic applicationsCPU in media processing
But ?High performance
Variable packet sizes
32
PhilipsResearch
Outline
• DVP
• Eclipse DVP subsystem
• Eclipse architecture• Model of computation
• Generic architecture• Coprocessor shell interface
• Shell communication interface
• Architecture instantiation
• Eclipse application programming
• Simulator
• Status
33
PhilipsResearch
Generic architecture: inter-processor communication
• On-chip, dedicated network for inter-processor communication:• Medium grain functionsHigh bandwidth (up to several GB/s)Keep data transport on-chip
• Use DVP-bus for off-chip communication only
34
PhilipsResearch
Generic architecture: communication network
CoprocessorCoprocessorCPU
Communication network
35
PhilipsResearch
Generic architecture: memory
• Shared, single address space, memory model• Flexible access
• Software programming model
• Centralized wide memory• Flexible buffer allocation
• Fits well with stream processing
• Single wide memory-bus for communication• Simple and cost effective
36
PhilipsResearch
Generic architecture: shared on-chip memory
CoprocessorCoprocessorCPU
Communication network
Memory
37
PhilipsResearch
Generic architecture: task level interface
Partition functionality between application-dependentcore and generic support.Introduce the (co-)processor shell:
• Shell is responsible for application configuration, task scheduling, data transport and synchronization
• Shell (parameterized) micro-architecture is re-used for each coprocessor instance
• Allow future updates of communication network while re-using (co-)processor core design
• Implementations in HW or SW
38
PhilipsResearch
Communication network layer
Generic support layer
Computation layer
Generic architecture: layering
CoprocessorCoprocessorCPU
Shell-HW Shell-HWShell-SWShell-HW
Task-level interface
Communication interface
Communication network
Memory
39
PhilipsResearch
Task level interface: five primitives
Multitasking, synchronization, and data transport:
• int GetTask( location, blocked, error, &task_info)
• bool GetSpace ( port_id, n_bytes)
• Read( port_id, offset, n_bytes, &byte_vector)
• Write( port_id, offset, n_bytes, &byte_vector)
• PutSpace ( port_id, n_bytes)
GetSpace is used for both get_data and get_room calls.PutSpace is used for both put_data and put_room calls.
The processor has the initiative, the shell answers.
40
PhilipsResearch
Task level interface: port IO
a: Initial situation of ‘data tape’ with current access point:
b: Inquiry action provides window on requested space:
c: Read/Write actions on contents:
d: Commit action moves access point ahead:
n_bytes2
offset
n_bytes1
Task A
41
PhilipsResearch
Task level interface: communication through streams
Task A Task B
Space filled with data
Empty space
A B
Granted window for writer
Granted window for reader
Kahn model:
Implementation with shared circular buffer:
The shell takes care that the access windows have no overlap
42
PhilipsResearch
Task level interface: multicast
Forked streams:
The task implementations are fixed (HW or SW).Application configuration is a shell responsibility.
Task A
Task C
Task B
Space filled with data
Empty space
A B
Granted window for writer
Granted window for reader B
CGranted window for reader C
43
PhilipsResearch
Task level interface: characteristics
• Linear (fifo) synchronization order is enforced
• Random access read/write inside acquired window through offset argument
• Shells operate on unformatted sequences of bytesAny semantical interpretation is left to the processor
• A task is not aware of where its streams connect to,or other tasks sharing the same processor
• The shell maintains the application graph structure
• The shell takes care of: fifo size, fifo memory location, wrap-around addressing, caching, cache coherency, bus alignment
44
PhilipsResearch
Task level interface: multi-tasking
int GetTask( location, blocked, error, &task_info)
• Non-preemptive task scheduling
• Coprocessor provides explicit task-switch moments
• Task switches separate ‘processing steps’(Granularity: tens or hundreds of clock cycles)
• Shell is responsible for task selection and administration
• Coprocessor provides feedback to the shell on task progress
45
PhilipsResearch
Generic support layer
Communication network layer
Computation layer
Generic architecture: generic support
CoprocessorCoprocessorCPU
Shell-HW Shell-HWShell-SWShell-HW
Task-level interface
Communication interface
Communication network
Memory
46
PhilipsResearch
Generic support: the Shell
The shell takes care of:
• The application graph structure, supporting run-time reconfiguration
• The local memory map and data transport(fifo size, fifo memory location, wrap-around addressing, caching, cache coherency, bus alignment)
• Task scheduling and synchronization
The distributed implementation:
• Allows fast interaction with local coprocessor
• Creates a scalable solution
47
PhilipsResearch
Generic support: synchronization
Coprocessor A
Communication network
Shell
space – = n
PutSpace( port, n )
Coprocessor B
Shell
space + = n
GetSpace( port, m )
Message: putspace( gsid, n )
m space
• PutSpace and GetSpace return after local update or inquiry.
• Delay in messaging does not affect functional correctness.
48
PhilipsResearch
Generic support: application configuration
Coprocessor
Communication network
Shell
Tas
k_id
Str
eam
_id
Stream table Task table
addr size space gsid . . . info budget . . .str_id
Shell tables are accessible through a PI-bus interface
49
PhilipsResearch
Generic support: data transport caching
• Translate byte-oriented coprocessor interface to wide and aligned bus transfers.
• Separated caches for read and write.
• Direct mapped: two adjacent words per port
• Coherency is enforced as side-effect of GetSpace and PutSpace
• Support automatic prefetching and preflushing
50
PhilipsResearch
Generic support: cache coherency
a : R e a d f e t c h e s w o r d e n t i r e l y i n s i d e g r a n t e d w i n d o w
F i g u r e 1
b : R e a d f e t c h e s w o r d w h i c h e x t e n d s o u t s i d e w i n d o w , b u t i n s i d e k n o w n a v a i l a b l e s p a c e b : G e t S p a c e p r o v i d e s o w n e r s h i p o n r e q u e s t e d s p a c e :
c : R e a d f e t c h e s w o r d w h i c h e x t e n d s i n t o d i r t y s p a c e
R e a d r e q u e s t
M e m o r y t r a n s f e r u n i t s ( w o r d s )
G e t S p a c e w i n d o w A v a i l a b l e s p a c e k n o w n b y s h e l l
51
PhilipsResearch
Generic support: task scheduling
A simple task scheduler runs locally in each shell:
• Observes empty/full states of fifos and task blocking
• Round-Robin selection of ‘runnable’ tasks
• Parameterized ‘compute resource’ budgets per task
• Temporary disabling of tasks for reconfiguration at specified locations in the data stream
52
PhilipsResearch
Task scheduling: computation budget
• Computation budget = maximum number of time slices allowed per task selection– Relative budget value controls compute resource partitioning
over tasks
– Absolute budget value controls task switch frequency, influencing overhead of state save & restore
• Running budget is set to the computation budget each time the task is selected in round-robin order
• The running budget is decremented with a fixed clock period, once every time slice
53
PhilipsResearch
Task scheduling algorithm
TaskId++ mod NrTasks
N
Y
RunningBudget = Budget[TaskId]return TaskId
Runnable[TaskId]?
RunningBudget > 0& Runnable?
return TaskId
GetTask
RunningBudget– –
Clock Event
54
PhilipsResearch
Task scheduling algorithm: dynamic workload
Shell does not interpret media data but performs a best guess
• Space: the amount of available data/room in the stream buffer
• Blocked flag: true if insufficient space on the last inquiry
• Schedule flag: If false, a task may be selected even whenSpace = 0 (data dependent stream selection)
• Task Enable flag: true if the task is configured to be active
55
PhilipsResearch
Task scheduling algorithm: Runnable criterion
StreamsTask
ScheduleSpaceBlocked
TaskEnableRunnable
! | 0&!&
)(GetSpace! nbytesBlocked
SpacefalseBlocked increases PutSpaceexternal an when
56
PhilipsResearch
Task scheduling: parallel implementation
Task selection background process:1. For each task, check if it is runnable, based on available
space in the stream buffers
2. Select a new task from the list of runnable tasks, round-robin
Provide an immediate answer to a GetTask inquiry:– Continue current task if its computation budget is not depleted
– Otherwise, start pre-selected next task.
Selection of next task may lag behind on buffer status:– Only the active task decreases space in the stream buffer
– All incoming PutSpace messages increase space in the buffer
57
PhilipsResearch
Task scheduler implementation
Active Task
Runnable?
TaskSelection
PutSpace
Space Blocked Schedule TaskId . . .
GetSpaceGetTask
TaskId
RunningBudgetGetTask?
NextTask
Enable Runnable Budget . . .
Task TableStream Table
Coprocessor
ShellDecrement
Budget
58
PhilipsResearch
Generic support: internal view
ShellSync
DTW DTR SS TS
Coprocessor
Communication network
59
PhilipsResearch
Generic support layer
Communication network layer
Computation layer
Generic architecture: communication network
CoprocessorCoprocessorCPU
Shell-HW Shell-HWShell-SWShell-HW
Task-level interface
Communication interface
Communication network
Memory
60
PhilipsResearch
Communication network: characteristics
• Synchronization messages are passed through a token ring, allowing one message per clock cycle
• Fifos are mapped in a shared on-chip memory, allowing flexible application configuration.
• Data transport is implemented with a wide data bus:• DTL based bus protocol
• Separately arbitrated busses for read and write
• Independently pipelined for efficient single-word transfers
• All communication paths are uni-directional and pipelined, allowing the insertion of clock-domain bridges
61
PhilipsResearch
Communication network
Shell
Arbiter
SRAM
Shell
Token ring
Dual DTL bus
Communication network
62
PhilipsResearch
Communication network: clock domains
• VLIW CPU wants low and fixed latency for memory access.
• CPU and memory can run at high clock rate.
• Synthesized coprocessors and long bus must run at lower clock rate.
63
PhilipsResearch
Example Eclipse instantiation
2 x 128 bits @ 150MHz Local bus
32 Kbyte, 128 bit words
128 bits @ 300MHz
64 bits @ 150MHz 32 bits @ 75MHz
ShellShell
CoprocCoproc
ShellShell
CoprocCoproc
ShellShell
CoprocCoproc
ArbiterArbiter
LocalMemory
LocalMemory
ShellShell
CPU64CPU64
I$ D$
EB
DVP hubDVP hub PI bridgePI bridge
PI bus DVP bus
300 MHz clock domain 150 MHz clock domain
64
PhilipsResearch
Outline
• DVP
• Eclipse DVP subsystem
• Eclipse architecture
• Eclipse application programming• Coprocessor definition
• System software
• Simulator
• Status
65
PhilipsResearch
Coprocessor definition: starting point
ProcessFIFO Read
Write
A C
B Execute
• Model of computation: Kahn Process Networks
• YAPI: simple API to transform C programs into Kahn models
• Expose parallelism and communication
• Decisions on grain sizes for processes and data
• Adopted by various groups in Philips for application modeling
66
PhilipsResearch
Application C codeApplication C codeApplication C codeApplication C code
Generic YAPI Generic YAPI
Coprocessor definition: process
EclipseTailored YAPI
EclipseTailored YAPI
Function
Control
Function
Control
Function
ControlCoproc
67
PhilipsResearch
Coprocessor definition: control
• Define processing steps by inserting GetTask, breaking up process iterations.
• Choose explicit synchronization moments.
• Implement state saving around GetTask calls.
• Discern different data types that share a stream.
• Discern different functions to handle the data.
68
PhilipsResearch
Coprocessor definition: packets
Packets wrap data; packet headers indicate data type
Type Payload
NBytes Payload
0
Type1
Byte 0 Byte 1 Byte 2
69
PhilipsResearch
Coprocessor definition: location packets
Packets of type ‘location’:
• Payload holds unique identifier denoting location in the stream.
• Used for application reconfiguration at specified points in the data processing.
• All tasks forward location packets to output streams.
• Location identifiers are passed to the shell via GetTask.
• The shell compares a location identifier with its corresponding field in the task table. When these match:• The task is disabled.
• The shell sends an interrupt to the cpu.
• Location identifiers also serve as debug breakpoints.
70
PhilipsResearch
Coprocessor definition: example
while( true ){ tid = GetTask(location, blocked, error, &task_info); if (!tid) return; blocked = !GetSpace( IN, 2) || !GetSpace( OUT, 2); if (blocked) return;
// handle location packets Read( IN, 0, 2, &packet); if (IsLocation( packet)) { location = PayLoad( packet); Write( OUT, 0, 2, packet); PutSpace( IN, 2); PutSpace( OUT, 2); return; }
// handle real data ...
71
PhilipsResearch
Coprocessor definition: example
// handle real data size = NBytes( packet); blocked = !GetSpace( IN, 2 + size) || !GetSpace( OUT, OUTSIZE); if (blocked) return;
Read( IN, 2, size, &in_data); PutSpace( IN, 2 + size);
error = Compute( task_info, in_data, &out_data);
Write( OUT, 0, 2 + OUTSIZE, Packet( TYPE, OUTSIZE, out_data)); PutSpace( OUT, 2 + OUTSIZE); }
72
PhilipsResearch
System software
Different types of software:
• Media processing software kernels:TM-CPU software with media operations and communication/synchronization primitives.
• Runtime support:Task scheduler, Quality-of-service control.
• System re-configuration:Network programming, memory management.
73
PhilipsResearch
Outline
• DVP
• Eclipse DVP subsystem
• Eclipse architecture
• Eclipse application programming
• Simulator• Software architecture
• Retargetability
• Flexibility
• Performance metrics
• Status
74
PhilipsResearch
Simulation objective
• Verification and validation of the Eclipse architecture
• Architecture design space exploration
• Application development platform
• Starting point for hardware development
• Collaboration with LEP (Sandra)
• Transfer to PS-DVI (Dr. Evil)
75
PhilipsResearch
Simulator toolchain
Create Vld
Create Dct
Create Mc
Dct{
NTasks: 2
Shell{
NStreams: 2
Dtr.NPorts : 1
}
}
Applicationsetup
Architecturesetup
Performancemetrics
Wave forms
7: [Eclipse.Input.Shell.Ts.Computation] CoprocGetTask: location_id=0x0 blocked=08: [Eclipse.Input.Coproc.Computation] GetTask: location_id=0x0 blocked=0 new task_id=1 task_info=08: [Eclipse.Input.Coproc.Computation] GetSpace: port_id=0 size=13010: [Eclipse.Input.Coproc.Computation] Write: port_id=0 size=4 offset=0 data=0x457f801f11: [Eclipse.Input.Shell.Dtw.Computation] CoprocWrite: size=4 offset=0 data=0x457f801f12: [Eclipse.Input.Coproc.Computation] Write: port_id=0 size=4 offset=4 data=0x0201464c13: [Eclipse.Input.Shell.Dtw.Computation] CoprocWrite: size=4 offset=4 data=0x0201464c13: [Eclipse.Output.Shell.Ts.Computation] CoprocGetTask: location_id=0x0 blocked=014: [Eclipse.Input.Coproc.Computation] Write: port_id=0 size=4 offset=8 data=0x00000000
Debug traces
eclipse_sim -d2 -c1000 -l1 -DTHREADLEVEL=2
Simulationmode
76
PhilipsResearch
Simulator flexibility: simulation modes
Modes of execution
• Sequential executionApplication development with functional verification
• Timed executionSystem level performance analysis
• TSS executionHardware development
All execution modes are implemented in one code base.
Only the interfaces differentiate between these modes.
77
PhilipsResearch
Simulator: modeled hardware architecture
Coprocessor
Dtw Ss Ts
Sync
Dtr
Shell
Transport network
Sync networkSync network
Read Write
GetSpacePutSpace
GetTask
78
PhilipsResearch
Simulator software architecture
Coprocessor
Dtw Ss Ts
Sync
Dtr
Shell
Transport network
Sync networkSync network
IFIF IF IF
IF IF
IFIF
m m m m
s s s s
m m
ss
s m
79
PhilipsResearch
Simulator software architecture: shell
Dtw Ss Ts
Sync
Dtr
IF
IF
IFIF
s
sm
m IFs mm
s
ms
Shell
80
PhilipsResearch
DctVld McRlsq
Coproc
LeafComponent
Coproc Dtw Dtr Ss Sync Ts Transport
Interface 0..*0..*
Protocol
11
LeafComponent
Eclipse ShellClient
CompositeComponent
ComponentSetup()Init()MicroscopeRead()MicroscopeWrite()Run()
CompositeComponentLeafComponent
1..*1..*
Simulator components
81
PhilipsResearch
Simulator: sequential execution
• Very fast functional verification
• One single thread of control
• Communication through function calls
• Statistics, e.g. number of reads, cache misses, …
• Compiles and runs without TSS
82
PhilipsResearch
Simulator: sequential execution implementation
Simulate(){ for ( execution=0; execution=100; execution++ ) { Component->Run(); }}
SequentialSimulatorSimulate()
ComponentRun()
1..*1..*
83
PhilipsResearch
Simulator: timed execution
• Performance metrics
• Full communication protocols
• Sequential C-code via multi-threading
• Run time definition of threads
• Compiles and runs without TSS
84
PhilipsResearch
Simulator: timed execution implementation
Simulate(){ for ( cycle=0; cycle=1000; cycle++ ) { ComponentThread->JumpThread(); }}
ThreadingSimulatorSimulate()
ComponentThreadThread()
1..*1..* ComponentRun()
11
Thread(){ while( 1 ) { Component->Run(); }}
85
PhilipsResearch
Timed execution: Execute()
void Dct::Thread(){ while( 1 ) { Dct(); Execute(64); }}
void Execute(int delay){ while( delay > 0 ) { delay--;
JumpMain(); }}
void MainScheduler(){ for (int cycle=0; cycle < 10000; cycle++) { Dct->JumpThread(); Vld->JumpThread(); Mc->JumpThread(); }}
86
PhilipsResearch
void DtrInterface::Read(int port, int offset, int size, DataT &data){ PortOut.Set( port ); OffsetOut.Set( offset ); SizeOut.Set( size ); RequestOut.Set( !RequestOut ); while ( AckIn.Get() != RequestOut ) JumpMain(); data = DataIn.Get();}
void Dct::Thread()
{ while( 1 ) { … DtrInterface->Read(0,0,8,data); … }
}
Timed execution: Read()
void DtrInterface::Poll() { if ( RequestIn.Get() != AckOut ) { int port = PortIn.Get(); int offset = OffsetIn.Get(); int size = SizeIn.Get(); DataT data[size]; Dtr->Read( port, offset, size, data ); DataOut.Set( data ); AckOut->Set( RequestIn.Get() ); }} void Dtr::Read(int port, int offset, int size, DataT &data)
{ … // Get data from cache data = …
}
87
PhilipsResearch
Simulator: TSS execution
• Dynamic binding of TSS code to the simulator
• Run time definition of TSS module boundaries
• Thread model inside TSS module
• TSS port creation
• Automatic Netlist generation
88
PhilipsResearch
Simulator: TSS execution implementation
Clock(){ ComponentThread->JumpThread();}
ComponentRun()
Thread(){ while( 1 ) { Component->Run(); }}
TssSimulatorSimulate()
TssModuleClock()
ComponentThreadThread()
111..*1..*
89
PhilipsResearch
Shell
Shell
TSS: module boundaries
Vld
Transport Network
Dtr Dtw Ss Ts
Sync
Mc
Dtr Dtw Ss Ts
Sync
Vld.ModuleName : Vld
Vld.Shell.ModuleName : VldShell
Mc.ModuleName : Mc
Mc.Shell{
Dtr.ModuleName : McShellDtr
Dtw.ModuleName : McShellDtw
Ss.ModuleName : McShellSs
Ts.ModuleName : McShellTs
Sync.ModuleName : McShellSync
}
Transport.ModuleName : Transport
90
PhilipsResearch
TSS: module boundaries
Vld
Shell
Transport Network
Dtr Dtw Ss Ts
Sync
Vld
Shell
Dtr Dtw Ss Ts
Sync
ModuleName : Eclipse
91
PhilipsResearch
TSS: co-simulation TSS-Verilog
Ts
Vld
Shell
Transport Network
Dtr Dtw Ss Ts
Sync
Mc
Shell
Dtr Dtw Ss Ts
Sync
92
PhilipsResearch
Simulator retargetability
Create Vld
Create Dct
Create Mc
Dct{
NTasks: 2
Shell{
NStreams: 2
Dtr.NPorts : 1
}
}
Applicationsetup
Architecturesetup
Performancemetrics
Wave forms
7: [Eclipse.Input.Shell.Ts.Computation] CoprocGetTask: location_id=0x0 blocked=08: [Eclipse.Input.Coproc.Computation] GetTask: location_id=0x0 blocked=0 new task_id=1 task_info=08: [Eclipse.Input.Coproc.Computation] GetSpace: port_id=0 size=13010: [Eclipse.Input.Coproc.Computation] Write: port_id=0 size=4 offset=0 data=0x457f801f11: [Eclipse.Input.Shell.Dtw.Computation] CoprocWrite: size=4 offset=0 data=0x457f801f12: [Eclipse.Input.Coproc.Computation] Write: port_id=0 size=4 offset=4 data=0x0201464c13: [Eclipse.Input.Shell.Dtw.Computation] CoprocWrite: size=4 offset=4 data=0x0201464c13: [Eclipse.Output.Shell.Ts.Computation] CoprocGetTask: location_id=0x0 blocked=014: [Eclipse.Input.Coproc.Computation] Write: port_id=0 size=4 offset=8 data=0x00000000
Debug traces
eclipse_sim -d2 -c1000 -l1 -DTHREADLEVEL=2
Simulationmode
93
PhilipsResearch
Simulator retargetability: Eclipse instantiation
Create Vld
Create Dct
Create Mc
Dct{
NTasks: 2
Shell{
NStreams: 2
Dtr.NPorts : 1
}
}
Vld Dct
Shell Shell
Transport Network
Mc
Shell
94
PhilipsResearch
McRun()
RlsqFactoryCreateCoproc()
McFactoryCreateCoproc()
Creates
VldFactoryCreateCoproc()
Creates
CoprocFactoryCreateCoproc()
CoprocFactoryRegistryRegister()GetCoprocFactory(Name)
Name1..*1..*
DctFactoryCreateCoproc()
Register
CreatesCreates
Name
DctRun()
RlsqRun()
VldRun()
Coprocessor instantiation CoprocInit()Run()GetTask()Read()Write()GetSpace()PutSpace()Execute()
DctRun()
RlsqRun()
VldRun()
McRun()
Create Vld Create Vld Create Mc
Shell
95
PhilipsResearch
DctVld McRlsq
Coproc
LeafComponent
Coproc Dtw Dtr Ss Sync Ts Transport
Interface 0..*0..*
Protocol
11
LeafComponent
Eclipse ShellClient
CompositeComponent
ComponentSetup()Init()MicroscopeRead()MicroscopeWrite()Run()
CompositeComponentLeafComponent
1..*1..*
Architecture setup
96
PhilipsResearch
Retargetability: application configuration
Dct.Shell{
Ss.StreamTable{
TASK_ID: 1
BUF_SPACE : 0x100
}
}
97
PhilipsResearch
DctVld McRlsq
Coproc
LeafComponent
Coproc Dtw Dtr Ss Sync Ts Transport
Interface 0..*0..*
Protocol
11
LeafComponent
Eclipse ShellClient
CompositeComponent
ComponentSetup()Init()MicroscopeRead()MicroscopeWrite()Run()
CompositeComponentLeafComponent
1..*1..*
Application setup
98
PhilipsResearch
Simulator output
Create Vld
Create Dct
Create Mc
Dct{
NTasks: 2
Shell{
NStreams: 2
Dtr.NPorts : 1
}
}
Applicationsetup
Architecturesetup
Performancemetrics
Wave forms
7: [Eclipse.Input.Shell.Ts.Computation] CoprocGetTask: location_id=0x0 blocked=08: [Eclipse.Input.Coproc.Computation] GetTask: location_id=0x0 blocked=0 new task_id=1 task_info=08: [Eclipse.Input.Coproc.Computation] GetSpace: port_id=0 size=13010: [Eclipse.Input.Coproc.Computation] Write: port_id=0 size=4 offset=0 data=0x457f801f11: [Eclipse.Input.Shell.Dtw.Computation] CoprocWrite: size=4 offset=0 data=0x457f801f12: [Eclipse.Input.Coproc.Computation] Write: port_id=0 size=4 offset=4 data=0x0201464c13: [Eclipse.Input.Shell.Dtw.Computation] CoprocWrite: size=4 offset=4 data=0x0201464c13: [Eclipse.Output.Shell.Ts.Computation] CoprocGetTask: location_id=0x0 blocked=014: [Eclipse.Input.Coproc.Computation] Write: port_id=0 size=4 offset=8 data=0x00000000
Debug traces
eclipse_sim -d2 -c1000 -l1 -DTHREADLEVEL=2
Simulationmode
99
PhilipsResearch
Simulation output: wave forms
100
PhilipsResearch
Simulator output: performance data collection
• Collection of critical performance indicators
• Subset of performance indicators implemented in HWin stream and task tables
• Used for:• Architecture evaluation at silicon design time
• Application tuning at application design time
• QoS resource management at run-time
101
PhilipsResearch
Viewing performance data
102
PhilipsResearch
Viewing performance data: processor dynamics
103
PhilipsResearch
Viewing performance data: processor metrics
104
PhilipsResearch
Viewing performance data: buffer filling
105
PhilipsResearch
Outline
• DVP
• Eclipse DVP subsystem
• Eclipse architecture
• Eclipse application programming
• Simulator
• Status
106
PhilipsResearch
Status
Abs
trac
tion
High
Low
Cos
tLow
High Alternative realizations
Initial architecture study (1997)
Feasibility study(October 1998)
Generic architecture definition (August 1999)
Specific architecture definition (February 2000)
Specific architecture implementation (July 2000)
107
PhilipsResearch
Current status
• Eclipse documentation • Concepts
• Design path
• Implementation
• Applications:• Coprocessor functional models for MPEG2 HD/SD decoding
(Vld, Mc, Idct, Rlsq) supporting downscaling
• MPEG2 encoder generic Yapi
• MPEG4, 3D Gfx scheduled for 2001
• Natural Motion anticipated
108
PhilipsResearch
Simulator status
• Simulator framework:• Retargetable and flexible through design patterns
• Re-use of methodology, design patterns, implementation (Sandra, QoS, TSSA-2)
• Simulator hardware model:• Functional, bit-level accurate model of shells
• Abstract model of transport network and coprocessors
• Simulator toolchain:• Approx. 25,000 lines of C++ code, 250 file
( CVS version management, multi-platform makefile structure,
automatic source documentation )
• Integration testing phase
• Submitted to CRE 2001
109
PhilipsResearch
Conclusion
• Eclipse fits neatly in DVP system level architecture
• Flexibility through:
• Application (re-)configuration
• Medium-grain HW / SW interaction
• Co-processor multi-tasking (without runtime CPU control)
• Cost-effectiveness through:
• HW / SW balancing
• Time-shared co-processor use
• Tools for application configuration, simulation, and performance analysis are alive
110
PhilipsResearch
Acknowledgements
Persons from several groups in PRLE:
• IPA (Lippmann): Evert-Jan Pol, Jos van Eijndhoven, Martijn Rutten, Anup Gangwar
• ESAS (van Utteren): Pieter van der Wolf, Om Prakash Gangwal, Gerben Essink
• IT (Dijkstra): Koen Meinds
• Video processing & Visual Perception (Depovere): Gerben Hekstra, Egbert Jaspers, Erik van der Tol, Martijn van Balen
• Digital Design & Test (Niessen): Manish Garg