Upload
dinhkien
View
214
Download
0
Embed Size (px)
Citation preview
A Holistic Approach for
building MPSoCs.
Jordi Carrabina
CAIAC. UAB,
Barcelona (Catalonia,Spain)[email protected]
Thanks to David Castells, Eduard Fernandez, Albert Saa
UAB location
Barcelona
UAB. Campus Bellaterra
Overview
� Short grous biopic
� Fundamental Concepts
� Platform-based design
� MPSoC & NoCs
� Tools for design and verification
� Application example
Ambient Intelligence & Accessibility
Research lines
� Content design
� VR, modeling & videogames
� MM platforms & Interactivity
� Speech technologies
� Computation & integration technologies
• Electronics are being integrated in complex environments
• Solutions have to be physically and functionally flexible
Goal: flexible systems
ES Industrial Developments
Mitsubishi Electric easyPhoto Praesentis submarine video logger
On-Laser Laser marking controller ProjectsPARMA (ITEA2 2007-2010)
SMECY (Artemis)
H4H (ITEA2 2010-2013)
MANY (ITEA2 2011-2014)
COBRA (CATRENE 2010-2013)
HARP (CATRENE 2013-2015)
Agnitio (AVANZA 2009-2012)
Documeet (FP7 2012-2014)
Fundamental Concepts
Basic Concepts: More-Moore (MPSoC)
& More-than-Moore (PE)
Ambient Intelligence: cheap, interoperable,
low power embedded software platforms
Explicit
computing Nomadic &
private spacesSensors,Actuators
Ambient
100Watt 1Watt 100mW 100µµµµW
“Watt” (mains) “Milliwatt” (battery “Microwatt” (ambient)
and cheap consumer)
1Tflops 100Gops 10Gops 10Mops
GP SW E SW E SW
Courtesy: H.De Man, ESSCIRC’06
Gates & Wires
Platform-based design
Platforms in Electronic Systems
A platform is a family of architectures satisfying a set of
constraints imposed to allow the reuse of hardware
and software components. However, a hardware
platform is not enough. Quick, reliable, derivative
design requires using a platform application
programming interface (API) to extend the platform
toward application software. In general, a platform is
an abstraction layer that covers many possible
refinements to a lower level. Platform-based design is a
meet-in-the-middle approach: In the top-down design
flow, designers map an instance of the upper platform
to an instance of the lower, and propagate design
constraints [Sangiovanni-Vincentelli, 2002].
Design Evolution (I) (Gajski)
Design Evolution (II) (General)
Model “Golden”
Untimed
Timed
Cycle accurate
Architecture
HW
SW
OS
FW
ImplementationI/O
Memory
Power supply
Clock & reset
Design Evolution (III) (F. Katthoor)
Algorithms Data Structures+
ARM
IP1 IP2
RAM ROM
Architecture
Platform architecture
RAM
RAM
ROM
MMU
custom
logic
DSP
ROM micro
processor© imec 2002
Platform-based design: API http://embedded.eecs.berkeley.edu/metropolis/platform.html
14
Platform Design Methodologies: Platform Stacks
Application
Architecture
System Platform Stack
Silicon Implementation
Silicon Implementation Platform Stack
Architecture PlatformInstance
Silicom ImplementationPlatform Instance
Application
Implementation
Path to
industrialization
MPSoC & NoCs
SoC Complexity
� Clock Domain and GALS Model– The “spatial” domain (tile) of the clock signal is being reduced with the
increase of integration density and operating frequency
� Globally Asynchronous– Different synchronous regions
communicate through asynchronous protocols or aclock hierarchy
� Locally Synchronous– In each synchronous region there is only one valid clock that relies on
the “classical” set of correct design rules
<2000
>2000
Homo- & Heterogeneous IP Arrays
Power Management
� Reduce power consumption by adapting supply
voltage and clock frequency to computational needs
(at task level)
– Dynamic Frequency and Voltage Scaling
SoC Paradigm (Single-Tile)
� Single processor model
“Classical” processor
+ / or
HW Blocks
with a
Flexible bus
ProgramMemory
CPU
DataMemory
ArbiterArbiter
DataMemory
ArbiterArbiter
HW
Accelerator
HW
Accelerator
Peripherals
MPSoC Speed Perfomance
� Execution time for a given application:
..
.
.
.
.
.
cyc
nseg
inst
cyc
proc
inst
funct
Nproctfunction ∗∗∗=
22
� Parallelism
� Networks� Compilers
� Instruction Set
� Coprocessors
� Micro-Architecture
� Inst. Parallelism
� Technology
� Device
PLATFORM
The beginnings…
From scalability to NoCs
� Interconnect is becoming major
design bottleneck
– Point-to-point
– On-chip busses
• Shared or hierarchical
– Crossbars
– Network-on-chip
� Design Space Exploration
through EDA/CAD tools
� IP core reusability and
efficient HW-SW interfaces
� Embedded software design
– Parallel programming models
– Runtime middleware routines
Em
bedded
softw
are
Execution p
latform
HW
/SW
inte
rfaces
Em
bedded s
yste
m Softw
are
hard
ware
HW
/SW
tra
de-o
ffs
A
B
C
A
wrapper write i/f
Network
read i/f
B
wrapper
...
snd(msg1, 0)
...
snd(msg2, 1)
0
1
0
1block A
...
r1=rcv(0)
...
r2=rcv(1)
block B
block C
0
B
C
A 0,0
2,2
2,0
routing
table
[ 2, 2, 1, msg1 ][ B, 1, msg1 ]
logical
destination
[ 1, 2, 1, msg1 ][ 0, 2, 1, msg1 ][ 0, 1, 1, msg1 ][ 0, 0, 1, msg1 ][ 1, msg1 ]
Process Communication (NoC)
Preliminary NoC Concepts
� Network-on-chip (NoC) View
– Links
– Switch
– Network Interface (NI)
– System components (IPcores)
� NoC design space is huge
– Topology
– Routing algorithm
– Switching techniques
– Buffering/Virtual Channels
• Location/Depth/Flit width
– Channel arbitration
– Flow control
NI DSP
NIMPEG
DRAMNI
NIAccel
CPU NI
NI DMA
NoC
switch
switch
switch switch
switch
switch
NII/O
NICoproc
[ogras05]
[castells06]
Data
Address
Processor
(Bus Master)
32-BitInterrupt
Controller
Address
Data
Arbiter
Clock 1 Clock 2 Clock 1
Address
Decoder
Ethernet
(Bus Master)
32-Bit
Timer
16-Bit
UART
8-Bit
DDR2
64-BitPCI
64-Bit
Memory
32-Bit
Width-Match Width-MatchWidth-Match Width-Match Width-Match
FPGA & GALS
© 2007 Altera Corporation
Tools for MPSoC design
and verification:
Holistic approach
Current FPGAs can embed several
RISC processors creating
a many-soft-core (>100 processors)
Some embedded systems are
• Application specific
• Low production units
• Require to meet some
constraints (energy, performance)
This can economically justify the use of many-soft-cores
tailored to specific applications (that can include specific
instructions & co-processors)
Many Soft-Core Systems
EDA Tools for MPSoCs
� NoCMaker EDA tool
(http://sourceforge.net/projects/nocmaker/)
– Efficient RTL code generation of a NoC for fast prototyping
– Easy capture and quick tuning optimization of NoCs by a GUI
– Easy simulation, verification and validation of HW-SW
components on a NoC-based system
– Automatic generation of synthetic traffic
• Identification of bottlenecks/congestion (performance, workload,
balancing)
– Early area, performance and power pre-synthesis
estimations
– Visual interactive simulator window
EDA Tools: NocMaker
� Cross-platform open-source EDA tool to design
space exploration of NoC-based systems
�NoCMaker is based on JDHL [bellows1998]
Modeling NoC using NocMaker
Simulate Traffic Patterns & Applications
� Traffic patterns– Finite or infinite Custom
• Nodes, IR and message length
– Pre-defined traffic patterns
• Universal, Bit reversal, perfect
shuffle, butterfly, matrix transpose,
complement
• Token pass, barrier, …
� Parallel applications
(ocMPI)– MPI-based Java stack to run
message passing apps on
NoCMaker
Bellat HW-SW --
Simulation and Validation
Simulation of NoC-based MPSoCs Validation and verification
– Circuit browser and waveform viewer
– Packet sequence analyzer
� Detect application deadlocks/livelocks when message passing parallel
applications run on top of the NoC-based MPSoC
Off-loading MPI
Eager Protocol
Fast but…
requires buffer on Receiver
Rendez-vous Protocol
Requires synchronization…
adding overhead
Off-loading MPI
� The overhead of synchronization in Rendez-vous is caused
by additional instruction execution and the processing of
short simple synchronization messages
� Solution: Implement a (small) independent NoC for
synchronization and a NI that contains synchronization
primitives
Applications: Sharing FPUs
� FPUs have independent units for each FP operation and
they are pipelined, pipeline is inefficiently used (no multiple
operations on fly)
� Solution: fill the pipelines with FP operations from tiles
Mandelbrot
Matrix
Multiplication0,0%
1,0%
2,0%
3,0%
4,0%
5,0%
6,0%
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Processors
Clock Cycles Overhead
Mandelbrot
Matrix
Multiplication
-1,00%
0,00%
1,00%
2,00%
3,00%
4,00%
5,00%
6,00%
2 4 6 8 10 12 14 16
Processors
Time Overhead
FPU
FPU
NOC
NOC
NICs
NICs
CPUs
CPUs
Others
Others
0
20
40
60
80
100
120
140
FPU Shared FPU
KLUTs+KFFs
IS extensions to reduce latency
1,21 1,18
2,91
2,43
0
100
200
300
400
500
600
700
NocMaker
tokenpass
NocMaker
reqmas
tokenpass reqmas
0,00
0,50
1,00
1,50
2,00
2,50
3,00
3,50
Avalon NA CI NA Speedup
� Embed communication primitives at the ISA level
� Implemented as NIOS
Custom Instruction
Performance Analysis Support
� Source-to-Source compiler to instrument code with
custom instructions
� CI inject traces in a dedicated network
� Tracing unit to dump traces to external memory
� Benefits: Scalable solution with minimal overhead
Build full system virtual platforms consisting of IIS + (NoC
and NIC) HDL models to introduce time accuracy for
computation while reducing verification time
Use transparent instrumentation
• Avoiding instrumentation overhead
• Avoiding memory / bandwidth required
TRACE LOGGING
FUNCTION
.
.
.
double func(int a, int b)
{
__asm ret
}
.
.
.
Original Function
Body
TRACE LOGGING
FUNCTION
Trace Memory
time stamp vt0, enter func
time stamp vt1, leave func
.
.
.
.
.
.
__asm call func
.
.
.
Calling Function Called Function
Find function name
from function
address
Call Log(Enter)
Push function name
into stack
Pop functoin name
from stack
Call Log(Leave)Virtu
al
Tim
e
vt0
vt1
Performance Analysis using Virtual Platforms
QEmu/SystemC
� Mix the SystemC HDL with QEmu for ESL design
– Various possible levels (TLM, RTL)
– Either SystemC as device or QEmu as a SystemC
module
Application example
using many soft-cores
Application
� Multi-Soft-Core based Laser Marking Controller
� On-Laser is a Spanish SME that creates application
specific laser-based systems
Introduction
� Off-the-shelf components
– Expensive & Less flexible
• Adaptation to multiple machine
• High productivity as
selling argument
� Hard-RT requirements (laser burns)
� Goal: Create a new controller with high productivity
& flexibility, meeting RT constrains, reduced cost,
simultaneous control of 4 heads (high laser usage)
– Incremental approach for fast development and
systematic testing, avoiding custom HW development
Laser Marking Process
� Components for Laser Marking in 1D
– Laser Source
– Galvanometer Mirror
– Lens
Laser Marking Process
� Going two 2D
– Additional
Galvanometer
mirror
Galvanometer
Galvanometer
Lens
Laser Light
Source
Laser Marking Process
� Rotating system (e.g. for cork stoppers)
– Additional
Servo / Encoder
pair
Galvanometer
Galvanometer
Motor / Encoder
Lens
Laser Light
Source
Laser Marking Process
� 4 multiple heads with 4 laser light sources
Galvanometer
Galvanometer
Motor / Encoder
Lens
Laser Light
Source
Galvanometer
Galvanometer
Motor / Encoder
Lens
Laser Light
Source
Galvanometer
Galvanometer
Motor / Encoder
Lens
Laser Light
Source
Galvanometer
Galvanometer
Motor / Encoder
Lens
Laser Light
Source
Laser Marking Process
� Multiplexing a single laser light source
(patented by On-Laser)
Initial HW / SW decisions
� OTS low cost FPGA board desired (Altera)
– One NIOS-II will be used to control every marking head
– 4 “independent” soft-cores will be used
– Synchronization required when multiplexing the laser light source
� Shared memory architecture: easy share program and data
� Laser pulses in the order of few µs and has hard real-time
requirements => HW mandatory
� Scan Head Galvanometers controlled through proprietary
protocols (XL2-100, SL2) that continuously send their
coordinates through a serial connection => HW mandatory
� Motor pulses are relatively slow (few ms or below) => SW ?
� Encoder readings must be quickly & accurately integrated to
correct galvanometer position => HW ?
Laser Pulse Control
� Pulse requirements
– A pulse is generated for every image pixel
– Pulses duty cycle determine greyscale level
– Pixels are grouped in rasters that must be
burned sequentially avoiding time gaps
� HW design
– Custom instruction that receives “active time” and “duration” operands
– Non-blocking operation to avoid temporal gaps
– Raster control in SW
Laser Light
Source
Scan Head Control
� Positioning Requirements
– Use of industrial protocols (XL2-100, SL2)
– Continuously updating the position
� HW Design
– Custom Instruction that receives X and Y positions as operands
– Non-blocking operation
Validation of 2D marking sytem
� Early validation of 2D marking
� First productivity measurements
� First quality assessments
Extending to rotary systems
� Introduce a virtual coordinate system
� Integrate rotating movement
Phy X = X Origin Offset + Virtual X – Encoder Advance
Phy Y = Y Origin Offset + Virtual Y
Phy X
Phy Y
Encoder
Advance
Virtual Y
Virtual X
Motor Control� Requirements
– Motor Pulse Generation is slow
– But encoder feedback should be immediately integrated to correct
scan head position
� HW Design
– PIO for motor control
– Avalon Bus slave device for encoder reading. Can be programmed
to increment / decrement a number of arbitrary units
AV
ALO
N
Tiled Design
� Independent Tiles are almost identical
� They share a common bus to access SDRAM controller
and on-chip memory
Design flow
� Quartus II
Design flow� SOPC builder
– Reset control not easy -> Verilog manipulation
� Currently migrating to QSys
Boot loading process
� Problem
– Altera tools allow booting from EPCS (flash) or from RAM
(downloaded from host when debugging)
– Only the processor that has the EPCS controller can boot from EPCS
– Offsets for (CPU1, CPU2, CPU3) in EPCS are unknown
� Solution
– Boot CPU0 by standard EPCS bootloader
– Develop a custom bootloader executed in CPU0 that
• Reset the slave CPUs
• Transfer data from EPCS to RAM
• Boot slaves from RAM
– Slaves code is only a bootloader to receive main function pointer from
CPU0
Host Communication
� Host SW
– download of images to mark
– Control activation of the heads
� Communication through
JTAG UART with CPU0
– CPU0 has to forward
messages to other
CPUs if necessary
Programming Model
� Message passing to coordinate operation
– Matrix of software mailboxes (single message)
– Implemented in shared memory
– IO operations to bypass cache
� Shared memory allows to reuse program and data
– Slave bootloaders transfer execution to CPU0 main code
void sendPipe(UnidirectionalPipe* pipe, int v)
{
int busy = 0, i = 5;
loop_send:
// ensure no data in the pipe
busy = IORD(&pipe->available, 0);
if (!busy) {
IOWR(&(pipe->data), 0, v);
IOWR(&(pipe->available), 0, 1);
}
if (busy) {
busyLoop(i++);
goto loop_send;
}
}
int recvPipe(UnidirectionalPipe* pipe)
{
int available = 0, data, i = 5;
loop_recv:
available = IORD(&(pipe->available), 0);
if (available) {
// take the data and free the flag
data = IORD(&(pipe->data), 0);
IOWR(&(pipe->available), 0, 0);
}
else {
busyLoop(i++);
goto loop_recv;
}
return data;
}
Performance analysis utilities
� HW timing tested by
– Simulation
– SignalTap
– External Analyzer
� SW timing tested by trace generation
– Automatic compiler instrumentation
– Generation of OTF traces to be visualized in HPC tools
Synthesis Results
Logic Elements
(LEs)
17058 / 22320
(76 %)
Memory bits
174504 / 608256
(29 %)
fmax 62.98 MHz
� Synthesis for Cyclone IV 22KLE
� Terasic DE0_Nano
Verification & Optimization
� Fundamental part of the design process (2/3)
� Unit testing
� Time analysis
� Real system-level productivity
� Quality observed in real execution
Results
Conclusions
� Technological evolution pushes for many-core
solutions in the embedded application-specific domain
� Multi-soft-core system proven highly flexible to allow:
– Fast coding, Early assessment about meeting requirements,
Incremental development, Minimal HW design (just where
needed),
– Able to reuse tools from HPC (MPI on many-soft-cores)
� From an industrial perspective
– Available OTS components
– Cost Reduction (~2 orders of magnitude)
– High efficiency (up to 97% of Laser Light usage)
Thanks for your
attention
HIP3ES 2014 . High Performance Energy Efficient Embedded Systems
– http://www.eurekamany.org/hip3es2014.html
– Vienna, Jan 21st 2014. Co-Located with HiPEAC 2014
– http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=33452©ownerid=8130
68
QoS
Experiments
69
Motivations & Key Challenges
� Many-Soft-Cores are sharing a common resource
� Problem: collisions & interference
� Two typical solutions
– Overdimension
– QoS
� Test how to make QoS visible to developer in a MP
programming model on a shared memory
architecture
70
QoS Support at NoC Level
� Traditional QoS techniques also in NoCs, but…
� Tight area and power constraints
☺ Control on the software stack
☺ Reconfiguration of tape out NoC-based chips using software facilities
� Related work
– QNoC [bolotin04]
– Combining BE and GT [andreasson04][marescaux07]
– Design to support multiple use cases [murali06][hansson07]
– QoS management at task level [hansson09][carara09]
� Handle QoS communication services on a message-passing
parallel programming model
71
Architectural Support for QoS Services
� Proposed QoS support extending best-effort xpipes NoC
– Soft-QoS: Configurable N-levels of priority (up to 8-levels)
– Hard-QoS: Guaranteed Throughput (GT) using emulation of circuit
switching
OTHER PACKET FIELDS SENDER
Q
O
S
ROUTE
� CPU triggers QoS at the Network Interface (NI) level
1. CPU writes into memory-mapped registers in the NI
2. CPU perform the actual transaction(s)
3. NI injects packets with suitable tag bits into NoC
72
Proposed Vertical Approach
� Runtime QoS support through HW and middleware
routines
– HW QoS features exposed through software middleware routines
– Vertical approach: QoS on top of the parallel programming
model (MPI)Software
Architecture
and control
Physical
application
system
transport
network
data link
wiring
Network
Adapter
Source
Core
Source
Node
Network
Adapter
Destination
Core
Destination
Node
Inter-
mediate
Node
Application/
Presentation/
Session/
Transport
Network
Link/Data
messages/transactions
packets/streams
link link
flits
73
QoS Runtime Middleware API� Interact with the Network Interface (NI) to enable NoC services through a simple
middleware API
� Soft/Hard-QoS Middleware API
– Few assembly instructions (few clock cycles)
� Low-level GT QoS routines (Hard-QoS)
– Few cycles depending on the endpoints and NoC flit width
inline int ni_open_channel(uint32_t address, bool full_duplex);
inline int ni_close_channel(uint32_t address, bool full_duplex);
int setPriority(int PROC_ID, int MEM_ID, int level);
int resetPriority(int PROC_ID, int MEM_ID);
int resetPriorities(void);
int sendStreamQoS(byte *buffer, int length, int MEM_ID);
int recvStreamQoS(byte *buffer, int length, int MEM_ID);
74
Parallel Programming Models on
MPSoCs� Traditional parallel programming models,
now in MPSoCs
– OpenMP [liu03][marongiu09]
– MPI [saldaña06][psota08][joven09]
� MPSoC programming challenges
– Hide hardware complexity & increase SW programmability
– Provide well-known parallel programming models
– Parallelize and map applications, tasks, data
– Expose communication QoS to programming models
� We believe in MPI-like programming suits well for many-core NoC-based systems
– Inherent distributed and scalable of NoC-based MPSoC embedded systems
– Low-latency interconnects allow fast message-passing inter-process communication
– Overhead of cache-coherent protocols used in shared memory
– Know-how and infrastructure available
Task i-1 Task i Task i+1
F1()
{
...
}
F2()
{
...
}
F3()
{
...
}
a[0]=...
b[0]=...
a[1]=...
b[1]=...
a[2]=...
b[2]=...
+ x /
Large grain
(task level)
Medium grain
(control level)
Fine grain
(data level)
Very fine grain
(multiple issue)
... ...
Messages Messages
75
QoS-aware ocMPI Library
Overview� Lightweight on-chip MPI communication library
– It does not rely on any OS
– All data structures have been simplified
– No support for virtual topologies
– No Fortran bindings and MPI I/O functions have been included yet
– ocMPI follow MPI 2.0 standard API prototype to keep code portability
0
2000
4000
6000
8000
10000
12000
14000
Basic stack Basic stack +
Management
Basic stack +
Profiling
Basic stack +
Management +
Profiling + Adv
communication
Code s
ize (in
byte
s)
NI driver/Middleware Basic ocMPI ocMPI Management
ocMPI Profiling Adv. Communication
~5KB
76
Exposing QoS on the ocMPI Library
� Exposing QoS features by means of the ocMPI_Tag
– Use of a mask on the ocMPI_Tag
� Trigger QoS services by simple annotation of critical tasks
– Automatic inlining of QoS middleware functions on the ocMPI librarysetPriority(), resetPriority()
ni_open_channel(), ni_close_channel()
– Enable/disable QoS services at NoC level ☺
People in the ES group
David Castells (Ms in Microelectronics) Founder of Histeresys (SME to sell
FPG-based devices), Associate Lecturer in microelectronics. Finishing
its PhD (100 cores on a DE-4)
Eduard Fernandez (Ms in Microelectronics) HIPEAC intership in Recore.
Finishing his PhD on Off-loading message passing functions
Albert Saa (Ms in Computer Vision). Source to source compilation for
parallel applications on MPSoCs
Former membersJaume Joven. ARM, PostDoc at
EPFL / iNocs
Marius Montón. Qemu/SystemC.
GreenSoCs, Now in WorldSensing
Eric Teruel. IIIAC Now CEO of
Finixer
JC Chak.Now in China
Jorge Zapata. Linux guru. Neuros,
OpenMoco, now in Fluendo
Miquel Izquierdo. UCI, now in Intel
Aitor Rodriguez (PhD) starting a
business
Sergi Risueño. Now in Varpe.