A Holistic Approach for building MPSoCs. - Institute for …cs.ipm.ac.ir/cads2013/files/slides/drcarrabina.pdf · 2013-11-02 · A Holistic Approach for building MPSoCs. Jordi Carrabina

A Holistic Approach for

building MPSoCs.

Jordi Carrabina

CAIAC. UAB,

Barcelona (Catalonia,Spain)[email protected]

Thanks to David Castells, Eduard Fernandez, Albert Saa

UAB location

Barcelona

UAB. Campus Bellaterra

Overview

� Short grous biopic

� Fundamental Concepts

� Platform-based design

� MPSoC & NoCs

� Tools for design and verification

� Application example

Ambient Intelligence & Accessibility

Research lines

� Content design

� VR, modeling & videogames

� MM platforms & Interactivity

� Speech technologies

� Computation & integration technologies

• Electronics are being integrated in complex environments

• Solutions have to be physically and functionally flexible

Goal: flexible systems

ES Industrial Developments

Mitsubishi Electric easyPhoto Praesentis submarine video logger

On-Laser Laser marking controller ProjectsPARMA (ITEA2 2007-2010)

SMECY (Artemis)

H4H (ITEA2 2010-2013)

MANY (ITEA2 2011-2014)

COBRA (CATRENE 2010-2013)

HARP (CATRENE 2013-2015)

Agnitio (AVANZA 2009-2012)

Documeet (FP7 2012-2014)

Fundamental Concepts

Basic Concepts: More-Moore (MPSoC)

& More-than-Moore (PE)

Ambient Intelligence: cheap, interoperable,

low power embedded software platforms

Explicit

computing Nomadic &

private spacesSensors,Actuators

Ambient

100Watt 1Watt 100mW 100µµµµW

“Watt” (mains) “Milliwatt” (battery “Microwatt” (ambient)

and cheap consumer)

1Tflops 100Gops 10Gops 10Mops

GP SW E SW E SW

Courtesy: H.De Man, ESSCIRC’06

Gates & Wires

Platform-based design

Platforms in Electronic Systems

A platform is a family of architectures satisfying a set of

constraints imposed to allow the reuse of hardware

and software components. However, a hardware

platform is not enough. Quick, reliable, derivative

design requires using a platform application

programming interface (API) to extend the platform

toward application software. In general, a platform is

an abstraction layer that covers many possible

refinements to a lower level. Platform-based design is a

meet-in-the-middle approach: In the top-down design

flow, designers map an instance of the upper platform

to an instance of the lower, and propagate design

constraints [Sangiovanni-Vincentelli, 2002].

Design Evolution (I) (Gajski)

Design Evolution (II) (General)

Model “Golden”

Untimed

Timed

Cycle accurate

Architecture

HW

SW

OS

FW

ImplementationI/O

Memory

Power supply

Clock & reset

Design Evolution (III) (F. Katthoor)

Algorithms Data Structures+

ARM

IP1 IP2

RAM ROM

Architecture

Platform architecture

RAM

RAM

ROM

MMU

custom

logic

DSP

ROM micro

processor© imec 2002

Platform-based design: API http://embedded.eecs.berkeley.edu/metropolis/platform.html

14

Platform Design Methodologies: Platform Stacks

Application

Architecture

System Platform Stack

Silicon Implementation

Silicon Implementation Platform Stack

Architecture PlatformInstance

Silicom ImplementationPlatform Instance

Application

Implementation

Path to

industrialization

MPSoC & NoCs

SoC Complexity

� Clock Domain and GALS Model– The “spatial” domain (tile) of the clock signal is being reduced with the

increase of integration density and operating frequency

� Globally Asynchronous– Different synchronous regions

communicate through asynchronous protocols or aclock hierarchy

� Locally Synchronous– In each synchronous region there is only one valid clock that relies on

the “classical” set of correct design rules

<2000

>2000

Homo- & Heterogeneous IP Arrays

Power Management

� Reduce power consumption by adapting supply

voltage and clock frequency to computational needs

(at task level)

– Dynamic Frequency and Voltage Scaling

SoC Paradigm (Single-Tile)

� Single processor model

“Classical” processor

+ / or

HW Blocks

with a

Flexible bus

ProgramMemory

CPU

DataMemory

ArbiterArbiter

DataMemory

ArbiterArbiter

HW

Accelerator

HW

Accelerator

Peripherals

MPSoC Speed Perfomance

� Execution time for a given application:

..

.

.

.

.

.

cyc

nseg

inst

cyc

proc

inst

funct

Nproctfunction ∗∗∗=

22

� Parallelism

� Networks� Compilers

� Instruction Set

� Coprocessors

� Micro-Architecture

� Inst. Parallelism

� Technology

� Device

PLATFORM

The beginnings…

From scalability to NoCs

� Interconnect is becoming major

design bottleneck

– Point-to-point

– On-chip busses

• Shared or hierarchical

– Crossbars

– Network-on-chip

� Design Space Exploration

through EDA/CAD tools

� IP core reusability and

efficient HW-SW interfaces

� Embedded software design

– Parallel programming models

– Runtime middleware routines

Em

bedded

softw

are

Execution p

latform

HW

/SW

inte

rfaces

Em

bedded s

yste

m Softw

are

hard

ware

HW

/SW

tra

de-o

ffs

A

B

C

A

wrapper write i/f

Network

read i/f

B

wrapper

...

snd(msg1, 0)

...

snd(msg2, 1)

0

1

0

1block A

...

r1=rcv(0)

...

r2=rcv(1)

block B

block C

0

B

C

A 0,0

2,2

2,0

routing

table

[ 2, 2, 1, msg1 ][ B, 1, msg1 ]

logical

destination

[ 1, 2, 1, msg1 ][ 0, 2, 1, msg1 ][ 0, 1, 1, msg1 ][ 0, 0, 1, msg1 ][ 1, msg1 ]

Process Communication (NoC)

Preliminary NoC Concepts

� Network-on-chip (NoC) View

– Links

– Switch

– Network Interface (NI)

– System components (IPcores)

� NoC design space is huge

– Topology

– Routing algorithm

– Switching techniques

– Buffering/Virtual Channels

• Location/Depth/Flit width

– Channel arbitration

– Flow control

NI DSP

NIMPEG

DRAMNI

NIAccel

CPU NI

NI DMA

NoC

switch

switch

switch switch

switch

switch

NII/O

NICoproc

[ogras05]

[castells06]

Data

Address

Processor

(Bus Master)

32-BitInterrupt

Controller

Address

Data

Arbiter

Clock 1 Clock 2 Clock 1

Address

Decoder

Ethernet

(Bus Master)

32-Bit

Timer

16-Bit

UART

8-Bit

DDR2

64-BitPCI

64-Bit

Memory

32-Bit

Width-Match Width-MatchWidth-Match Width-Match Width-Match

FPGA & GALS

© 2007 Altera Corporation

Tools for MPSoC design

and verification:

Holistic approach

Current FPGAs can embed several

RISC processors creating

a many-soft-core (>100 processors)

Some embedded systems are

• Application specific

• Low production units

• Require to meet some

constraints (energy, performance)

This can economically justify the use of many-soft-cores

tailored to specific applications (that can include specific

instructions & co-processors)

Many Soft-Core Systems

EDA Tools for MPSoCs

� NoCMaker EDA tool

(http://sourceforge.net/projects/nocmaker/)

– Efficient RTL code generation of a NoC for fast prototyping

– Easy capture and quick tuning optimization of NoCs by a GUI

– Easy simulation, verification and validation of HW-SW

components on a NoC-based system

– Automatic generation of synthetic traffic

• Identification of bottlenecks/congestion (performance, workload,

balancing)

– Early area, performance and power pre-synthesis

estimations

– Visual interactive simulator window

EDA Tools: NocMaker

� Cross-platform open-source EDA tool to design

space exploration of NoC-based systems

�NoCMaker is based on JDHL [bellows1998]

Modeling NoC using NocMaker

Simulate Traffic Patterns & Applications

� Traffic patterns– Finite or infinite Custom

• Nodes, IR and message length

– Pre-defined traffic patterns

• Universal, Bit reversal, perfect

shuffle, butterfly, matrix transpose,

complement

• Token pass, barrier, …

� Parallel applications

(ocMPI)– MPI-based Java stack to run

message passing apps on

NoCMaker

Bellat HW-SW --

Simulation and Validation

Simulation of NoC-based MPSoCs Validation and verification

– Circuit browser and waveform viewer

– Packet sequence analyzer

� Detect application deadlocks/livelocks when message passing parallel

applications run on top of the NoC-based MPSoC

Off-loading MPI

Eager Protocol

Fast but…

requires buffer on Receiver

Rendez-vous Protocol

Requires synchronization…

adding overhead

Off-loading MPI

� The overhead of synchronization in Rendez-vous is caused

by additional instruction execution and the processing of

short simple synchronization messages

� Solution: Implement a (small) independent NoC for

synchronization and a NI that contains synchronization

primitives

Applications: Sharing FPUs

� FPUs have independent units for each FP operation and

they are pipelined, pipeline is inefficiently used (no multiple

operations on fly)

� Solution: fill the pipelines with FP operations from tiles

Mandelbrot

Matrix

Multiplication0,0%

1,0%

2,0%

3,0%

4,0%

5,0%

6,0%

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Processors

Clock Cycles Overhead

Mandelbrot

Matrix

Multiplication

-1,00%

0,00%

1,00%

2,00%

3,00%

4,00%

5,00%

6,00%

2 4 6 8 10 12 14 16

Processors

Time Overhead

FPU

FPU

NOC

NOC

NICs

NICs

CPUs

CPUs

Others

Others

0

20

40

60

80

100

120

140

FPU Shared FPU

KLUTs+KFFs

IS extensions to reduce latency

1,21 1,18

2,91

2,43

0

100

200

300

400

500

600

700

NocMaker

tokenpass

NocMaker

reqmas

tokenpass reqmas

0,00

0,50

1,00

1,50

2,00

2,50

3,00

3,50

Avalon NA CI NA Speedup

� Embed communication primitives at the ISA level

� Implemented as NIOS

Custom Instruction

Performance Analysis Support

� Source-to-Source compiler to instrument code with

custom instructions

� CI inject traces in a dedicated network

� Tracing unit to dump traces to external memory

� Benefits: Scalable solution with minimal overhead

Build full system virtual platforms consisting of IIS + (NoC

and NIC) HDL models to introduce time accuracy for

computation while reducing verification time

Use transparent instrumentation

• Avoiding instrumentation overhead

• Avoiding memory / bandwidth required

TRACE LOGGING

FUNCTION

.

.

.

double func(int a, int b)

{

__asm ret

}

.

.

.

Original Function

Body

TRACE LOGGING

FUNCTION

Trace Memory

time stamp vt0, enter func

time stamp vt1, leave func

.

.

.

.

.

.

__asm call func

.

.

.

Calling Function Called Function

Find function name

from function

address

Call Log(Enter)

Push function name

into stack

Pop functoin name

from stack

Call Log(Leave)Virtu

al

Tim

e

vt0

vt1

Performance Analysis using Virtual Platforms

QEmu/SystemC

� Mix the SystemC HDL with QEmu for ESL design

– Various possible levels (TLM, RTL)

– Either SystemC as device or QEmu as a SystemC

module

Application example

using many soft-cores

Application

� Multi-Soft-Core based Laser Marking Controller

� On-Laser is a Spanish SME that creates application

specific laser-based systems

Introduction

� Off-the-shelf components

– Expensive & Less flexible

• Adaptation to multiple machine

• High productivity as

selling argument

� Hard-RT requirements (laser burns)

� Goal: Create a new controller with high productivity

& flexibility, meeting RT constrains, reduced cost,

simultaneous control of 4 heads (high laser usage)

– Incremental approach for fast development and

systematic testing, avoiding custom HW development

Laser Marking Process

� Components for Laser Marking in 1D

– Laser Source

– Galvanometer Mirror

– Lens


� Going two 2D

– Additional

Galvanometer

mirror

Galvanometer

Galvanometer

Lens

Laser Light

Source


� Rotating system (e.g. for cork stoppers)

– Additional

Servo / Encoder

pair

Galvanometer

Galvanometer

Motor / Encoder

Lens

Laser Light

Source


� 4 multiple heads with 4 laser light sources

Galvanometer

Galvanometer

Motor / Encoder

Lens

Laser Light

Source

Galvanometer

Galvanometer

Motor / Encoder

Lens

Laser Light

Source

Galvanometer

Galvanometer

Motor / Encoder

Lens

Laser Light

Source

Galvanometer

Galvanometer

Motor / Encoder

Lens

Laser Light

Source


� Multiplexing a single laser light source

(patented by On-Laser)

Initial HW / SW decisions

� OTS low cost FPGA board desired (Altera)

– One NIOS-II will be used to control every marking head

– 4 “independent” soft-cores will be used

– Synchronization required when multiplexing the laser light source

� Shared memory architecture: easy share program and data

� Laser pulses in the order of few µs and has hard real-time

requirements => HW mandatory

� Scan Head Galvanometers controlled through proprietary

protocols (XL2-100, SL2) that continuously send their

coordinates through a serial connection => HW mandatory

� Motor pulses are relatively slow (few ms or below) => SW ?

� Encoder readings must be quickly & accurately integrated to

correct galvanometer position => HW ?

Laser Pulse Control

� Pulse requirements

– A pulse is generated for every image pixel

– Pulses duty cycle determine greyscale level

– Pixels are grouped in rasters that must be

burned sequentially avoiding time gaps

� HW design

– Custom instruction that receives “active time” and “duration” operands

– Non-blocking operation to avoid temporal gaps

– Raster control in SW

Laser Light

Source

Scan Head Control

� Positioning Requirements

– Use of industrial protocols (XL2-100, SL2)

– Continuously updating the position

� HW Design

– Custom Instruction that receives X and Y positions as operands

– Non-blocking operation

Validation of 2D marking sytem

� Early validation of 2D marking

� First productivity measurements

� First quality assessments

Extending to rotary systems

� Introduce a virtual coordinate system

� Integrate rotating movement

Phy X = X Origin Offset + Virtual X – Encoder Advance

Phy Y = Y Origin Offset + Virtual Y

Phy X

Phy Y

Encoder

Advance

Virtual Y

Virtual X

Motor Control� Requirements

– Motor Pulse Generation is slow

– But encoder feedback should be immediately integrated to correct

scan head position

� HW Design

– PIO for motor control

– Avalon Bus slave device for encoder reading. Can be programmed

to increment / decrement a number of arbitrary units

AV

ALO

N

Tiled Design

� Independent Tiles are almost identical

� They share a common bus to access SDRAM controller

and on-chip memory

Design flow

� Quartus II

Design flow� SOPC builder

– Reset control not easy -> Verilog manipulation

� Currently migrating to QSys

Boot loading process

� Problem

– Altera tools allow booting from EPCS (flash) or from RAM

(downloaded from host when debugging)

– Only the processor that has the EPCS controller can boot from EPCS

– Offsets for (CPU1, CPU2, CPU3) in EPCS are unknown

� Solution

– Boot CPU0 by standard EPCS bootloader

– Develop a custom bootloader executed in CPU0 that

• Reset the slave CPUs

• Transfer data from EPCS to RAM

• Boot slaves from RAM

– Slaves code is only a bootloader to receive main function pointer from

CPU0

Host Communication

� Host SW

– download of images to mark

– Control activation of the heads

� Communication through

JTAG UART with CPU0

– CPU0 has to forward

messages to other

CPUs if necessary

Programming Model

� Message passing to coordinate operation

– Matrix of software mailboxes (single message)

– Implemented in shared memory

– IO operations to bypass cache

� Shared memory allows to reuse program and data

– Slave bootloaders transfer execution to CPU0 main code

void sendPipe(UnidirectionalPipe* pipe, int v)

{

int busy = 0, i = 5;

loop_send:

// ensure no data in the pipe

busy = IORD(&pipe->available, 0);

if (!busy) {

IOWR(&(pipe->data), 0, v);

IOWR(&(pipe->available), 0, 1);

}

if (busy) {

busyLoop(i++);

goto loop_send;

}

}

int recvPipe(UnidirectionalPipe* pipe)

{

int available = 0, data, i = 5;

loop_recv:

available = IORD(&(pipe->available), 0);

if (available) {

// take the data and free the flag

data = IORD(&(pipe->data), 0);

IOWR(&(pipe->available), 0, 0);

}

else {

busyLoop(i++);

goto loop_recv;

}

return data;

}

Performance analysis utilities

� HW timing tested by

– Simulation

– SignalTap

– External Analyzer

� SW timing tested by trace generation

– Automatic compiler instrumentation

– Generation of OTF traces to be visualized in HPC tools

Synthesis Results

Logic Elements

(LEs)

17058 / 22320

(76 %)

Memory bits

174504 / 608256

(29 %)

fmax 62.98 MHz

� Synthesis for Cyclone IV 22KLE

� Terasic DE0_Nano

Verification & Optimization

� Fundamental part of the design process (2/3)

� Unit testing

� Time analysis

� Real system-level productivity

� Quality observed in real execution

Results

Conclusions

� Technological evolution pushes for many-core

solutions in the embedded application-specific domain

� Multi-soft-core system proven highly flexible to allow:

– Fast coding, Early assessment about meeting requirements,

Incremental development, Minimal HW design (just where

needed),

– Able to reuse tools from HPC (MPI on many-soft-cores)

� From an industrial perspective

– Available OTS components

– Cost Reduction (~2 orders of magnitude)

– High efficiency (up to 97% of Laser Light usage)

Thanks for your

attention

HIP3ES 2014 . High Performance Energy Efficient Embedded Systems

– http://www.eurekamany.org/hip3es2014.html

– Vienna, Jan 21st 2014. Co-Located with HiPEAC 2014

– http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=33452©ownerid=8130

68

QoS

Experiments

69

Motivations & Key Challenges

� Many-Soft-Cores are sharing a common resource

� Problem: collisions & interference

� Two typical solutions

– Overdimension

– QoS

� Test how to make QoS visible to developer in a MP

programming model on a shared memory

architecture

70

QoS Support at NoC Level

� Traditional QoS techniques also in NoCs, but…

� Tight area and power constraints

☺ Control on the software stack

☺ Reconfiguration of tape out NoC-based chips using software facilities

� Related work

– QNoC [bolotin04]

– Combining BE and GT [andreasson04][marescaux07]

– Design to support multiple use cases [murali06][hansson07]

– QoS management at task level [hansson09][carara09]

� Handle QoS communication services on a message-passing

parallel programming model

71

Architectural Support for QoS Services

� Proposed QoS support extending best-effort xpipes NoC

– Soft-QoS: Configurable N-levels of priority (up to 8-levels)

– Hard-QoS: Guaranteed Throughput (GT) using emulation of circuit

switching

OTHER PACKET FIELDS SENDER

Q

O

S

ROUTE

� CPU triggers QoS at the Network Interface (NI) level

1. CPU writes into memory-mapped registers in the NI

2. CPU perform the actual transaction(s)

3. NI injects packets with suitable tag bits into NoC

72

Proposed Vertical Approach

� Runtime QoS support through HW and middleware

routines

– HW QoS features exposed through software middleware routines

– Vertical approach: QoS on top of the parallel programming

model (MPI)Software

Architecture

and control

Physical

application

system

transport

network

data link

wiring

Network

Adapter

Source

Core

Source

Node

Network

Adapter

Destination

Core

Destination

Node

Inter-

mediate

Node

Application/

Presentation/

Session/

Transport

Network

Link/Data

messages/transactions

packets/streams

link link

flits

73

QoS Runtime Middleware API� Interact with the Network Interface (NI) to enable NoC services through a simple

middleware API

� Soft/Hard-QoS Middleware API

– Few assembly instructions (few clock cycles)

� Low-level GT QoS routines (Hard-QoS)

– Few cycles depending on the endpoints and NoC flit width

inline int ni_open_channel(uint32_t address, bool full_duplex);

inline int ni_close_channel(uint32_t address, bool full_duplex);

int setPriority(int PROC_ID, int MEM_ID, int level);

int resetPriority(int PROC_ID, int MEM_ID);

int resetPriorities(void);

int sendStreamQoS(byte *buffer, int length, int MEM_ID);

int recvStreamQoS(byte *buffer, int length, int MEM_ID);

74

Parallel Programming Models on

MPSoCs� Traditional parallel programming models,

now in MPSoCs

– OpenMP [liu03][marongiu09]

– MPI [saldaña06][psota08][joven09]

� MPSoC programming challenges

– Hide hardware complexity & increase SW programmability

– Provide well-known parallel programming models

– Parallelize and map applications, tasks, data

– Expose communication QoS to programming models

� We believe in MPI-like programming suits well for many-core NoC-based systems

– Inherent distributed and scalable of NoC-based MPSoC embedded systems

– Low-latency interconnects allow fast message-passing inter-process communication

– Overhead of cache-coherent protocols used in shared memory

– Know-how and infrastructure available

Task i-1 Task i Task i+1

F1()

{

...

}

F2()

{

...

}

F3()

{

...

}

a[0]=...

b[0]=...

a[1]=...

b[1]=...

a[2]=...

b[2]=...

+ x /

Large grain

(task level)

Medium grain

(control level)

Fine grain

(data level)

Very fine grain

(multiple issue)

... ...

Messages Messages

75

QoS-aware ocMPI Library

Overview� Lightweight on-chip MPI communication library

– It does not rely on any OS

– All data structures have been simplified

– No support for virtual topologies

– No Fortran bindings and MPI I/O functions have been included yet

– ocMPI follow MPI 2.0 standard API prototype to keep code portability

0

2000

4000

6000

8000

10000

12000

14000

Basic stack Basic stack +

Management

Basic stack +

Profiling

Basic stack +

Management +

Profiling + Adv

communication

Code s

ize (in

byte

s)

NI driver/Middleware Basic ocMPI ocMPI Management

ocMPI Profiling Adv. Communication

~5KB

76

Exposing QoS on the ocMPI Library

� Exposing QoS features by means of the ocMPI_Tag

– Use of a mask on the ocMPI_Tag

� Trigger QoS services by simple annotation of critical tasks

– Automatic inlining of QoS middleware functions on the ocMPI librarysetPriority(), resetPriority()

ni_open_channel(), ni_close_channel()

– Enable/disable QoS services at NoC level ☺

People in the ES group

David Castells (Ms in Microelectronics) Founder of Histeresys (SME to sell

FPG-based devices), Associate Lecturer in microelectronics. Finishing

its PhD (100 cores on a DE-4)

Eduard Fernandez (Ms in Microelectronics) HIPEAC intership in Recore.

Finishing his PhD on Off-loading message passing functions

Albert Saa (Ms in Computer Vision). Source to source compilation for

parallel applications on MPSoCs

Former membersJaume Joven. ARM, PostDoc at

EPFL / iNocs

Marius Montón. Qemu/SystemC.

GreenSoCs, Now in WorldSensing

Eric Teruel. IIIAC Now CEO of

Finixer

JC Chak.Now in China

Jorge Zapata. Linux guru. Neuros,

OpenMoco, now in Fluendo

Miquel Izquierdo. UCI, now in Intel

Aitor Rodriguez (PhD) starting a

business

Sergi Risueño. Now in Varpe.

Documents

A Holistic Approach for building MPSoCs. - Institute for …cs.ipm.ac.ir/cads2013/files/slides/drcarrabina.pdf · 2013-11-02 · A Holistic Approach for building MPSoCs. Jordi Carrabina