76
1 Programming High Performance Embedded Systems: Tackling the Performance Portability Problem Alastair Reid Principal Engineer, R&D ARM Ltd

Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

  • Upload
    palmer

  • View
    45

  • Download
    0

Embed Size (px)

DESCRIPTION

Programming High Performance Embedded Systems: Tackling the Performance Portability Problem. Alastair Reid Principal Engineer, R&D ARM Ltd. Programming HP Embedded Systems. High-Performance Energy-Efficient Hardware Example: Ardbeg processor cluster (ARM R&D) - PowerPoint PPT Presentation

Citation preview

Page 1: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

1

Programming High Performance Embedded Systems:

Tackling the Performance Portability Problem

Alastair Reid

Principal Engineer, R&D

ARM Ltd

Page 2: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

2

Programming HP Embedded Systems

High-Performance Energy-Efficient Hardware Example: Ardbeg processor cluster (ARM R&D)

Portable System-level programming Example: SoC-C language extensions (ARM R&D)

Portable Kernel-level programming Example: C+Builtins

Example: Data Parallel Language

Merging System/Kernel-level programming

Page 3: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

3 3

Mobile Consumer Electronics TrendsMobile Application Requirements Still Growing Rapidly Still cameras: 2Mpixel 10 Mpixel Video cameras: VGA HD 1080p … Video players: MPEG-2 H.264 2D Graphics: QVGA HVGA VGA FWVGA … 3D Gaming: > 30Mtriangle/s, antialiasing, … Bandwidth: HSDPA (14.4Mbps) WiMax (70Mbps) LTE (326Mbps)

Feature Convergence Phone + graphics + UI + games + still camera + video camera + music + WiFi + Bluetooth + 3.5G + 3.9G + WiMax + GPS + …

Page 4: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

5 5

Mobile SDR Design Challenges

1

10

100

1000

0.1 1 10 100

Power (Watts)

Pe

ak

Pe

rfo

rma

nc

e (

Go

ps

)

Better

Pow

er Efficiency

10 Mops/m

W

100 Mops/m

W

1 Mops/m

W

5

GeneralPurpose

ProcessorsEmbeddedDSPs

Mobile SDRRequirements

Pentium MTI C6x

IBM CellHigh-end

DSPs

SDR Design Objectives for 3G and WiFi

Throughput requirements 40+Gops peak throughput

Power budget 100mW~500mW peak power

SDR Design Objectives for 3G and WiFi

Throughput requirements 40+Gops peak throughput

Power budget 100mW~500mW peak power

Slide adapted from M. Woh’s ‘From Scotch to SODA’, MICRO-41, 2008

Page 5: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

8 8

Energy Efficient Systems are “Lumpy”Drop Frequency 10x Desktop: 2-4GHz Mobile: 200-400MHz

Increase Parallelism 100x Desktop: 1-2 cores Mobile: 32-way SIMD Instruction Set, 4-8 cores

Match Processor Type to Task Desktop: homogeneous, general purpose Mobile: heterogeneous, specialised

Keep Memory Local Desktop: coherent, shared memory Mobile: processor-memory clusters linked by DMA

Page 6: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

10 10

512-bitSIMDReg.File

512-bitSIMD Mult

SIMDShuffle

Net-work

Scalar ALU+Mult

ScalarRF+ACC

L1Data

Memory

AGURF

AGU

1. wide SIMD

Pred.RF

SIMD+ScalarTransf

Unit

Ardbeg PE

3. Memory

SIMDPred.ALU

Scalarwdata

1024-bitSIMD

ACC RF

SIMDwdata

512-bitSIMD ALUwith

shuffle

EX

EX

INTERCONNECTS

INTERCONNECTS

L2Memory

2. Scalar & AGUL1ProgramMemory

Controller

EX

EX

AGU

AGU

WB

WB

WB

WB

64- b

it A

MB

A 3

AX

I In

terc

on

ne

ct

ControlProcessor

Ardbeg System

FECAccelerator

L1Mem

ExecutionUnit

PE

L1Mem

ExecutionUnit

PE

DMAC

Peripherals

L1Mem

L2Mem

512

-bit

Bu

s

Ardbeg SDR Processor

Application Specific HardwareApplication Specific Hardware

2-level memory hierarchy2-level memory hierarchy

8,16,32 bit fixed point support512-bit SIMD

8,16,32 bit fixed point support512-bit SIMD

Sparse Connected VLIWSparse Connected VLIW

Slide adapted from M. Woh’s ‘From Scotch to SODA’, MICRO-41, 2008

Page 7: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

11 11

W-CDMA 2Mbps

DVB-H

DVB-T

802.11a

W-CDMA data

W-CDMA voice

802.11a 180nm 802.11a

W-CDMA 2Mbps180nm W-CDMA 2Mbps

802.11a

W-CDMA 2Mbps

W-CDMA data

W-CDMA voice

W-CDMA data

802.11a

W-CDMA 2Mbps

0.01

0.1

1

10

100

0.01 0.1 1 10 100 1000

Power (Watts)

Ac

hie

ve

d T

hro

ug

hp

ut

(Mb

ps

)

Ardbeg

SODA

ASIC

Sandblaster

TigerSHARC

7 Pentium M

Summary of Ardbeg SDR Processor

• Ardbeg is lower power at same throughput• We are getting closer to ASICs

Slide adapted from M. Woh’s ‘From Scotch to SODA’, MICRO-41, 2008

Page 8: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

12 12

How do we program AMP systems?

C doesn’t provide language features to support Multiple processors (or multi-ISA systems)

Distributed memory

Multiple threads

Page 9: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

13 13

Use Indirection (Strawman #1)

Add a layer of indirection Operating System

Layer of middleware

Device drivers

Hardware support

All impose a cost in Power/Performance/Area

Page 10: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

14 14

Raise Pain Threshold (Strawman #2)

Write efficient code at very low level of abstraction

Problems Hard, slow and expensive to write, test, debug and maintain

Design intent drowns in sea of low level detail

Not portable across different architectures

Expensive to try different points in design space

Page 11: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

15 15

Our Response

Extend C Support Asymmetric Multiprocessors

SoC-C language raises level of abstraction

… but take care not to hide expensive operations

Page 12: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

16 16

SoC-C Overview

Pocket-Sized Supercomputers Energy efficient hardware is “lumpy” … and unsupported by C … but supported by SoC-C

SoC-C Extensions by Example Pipeline Parallelism Code Placement Data Placement

SoC-C Conclusion

Page 13: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

17 17

3 steps in mapping an application

1. Decide how to parallelize

2. Choose processors for each pipeline stage

3. Resolve distributed memory issues

Page 14: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

18 18

A Simple Programint x[100];

int y[100];

int z[100];

while (1) {

get(x);

foo(y,x);

bar(z,y);

baz(z);

put(z);

}

Page 15: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

19 19

Simplified System Architecture

Distributed Memories

Control Processor

SIMD Instruction SetData Engines

Accelerators

Artist’s impression

Page 16: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

20 20

Step 1: Decide how to parallelizeint x[100];

int y[100];

int z[100];

while (1) {

get(x);

foo(y,x);

bar(z,y);

baz(z);

put(z);

}

50% of work

50% of work

Page 17: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

21 21

Step 1: Decide how to parallelize int x[100];

int y[100];

int z[100];

PIPELINE {

while (1) {

get(x);

foo(y,x);

FIFO(y);

bar(z,y);

baz(z);

put(z);

}

}

PIPELINE indicates region to parallelize

FIFO indicates boundaries between pipeline stages

Page 18: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

22 22

SoC-C Feature #1: Pipeline Parallelism

Annotations express coarse-grained pipeline parallelism

PIPELINE indicates scope of parallelism

FIFO indicates boundaries between pipeline stages

Compiler splits into threads communicating through FIFOs

Page 19: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

23 23

Step 2: Choose Processors int x[100];

int y[100];

int z[100];

PIPELINE {

while (1) {

get(x);

foo(y,x);

FIFO(y);

bar(z,y);

baz(z);

put(z);

}

}

Page 20: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

24 24

Step 2: Choose Processors int x[100];

int y[100];

int z[100];

PIPELINE {

while (1) {

get(x);

foo(y,x) @ P0;

FIFO(y);

bar(z,y) @ P1;

baz(z) @ P1;

put(z);

}

}

@ P indicates processor to execute function

Page 21: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

25 25

SoC-C Feature #2: RPC Annotations

Annotations express where code is to execute Behaves like Synchronous Remote Procedure Call

Does not change meaning of program

Bulk data is not implicitly copied to processor’s local memory

Page 22: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

26 26

Step 3: Resolve Memory Issues int x[100];

int y[100];

int z[100];

PIPELINE {

while (1) {

get(x);

foo(y,x) @ P0;

FIFO(y);

bar(z,y) @ P1;

baz(z) @ P1;

put(z);

}

}

P0 uses x x must be in M0

P1 uses z z must be in M1

P0 uses y y must be in M0

P1 uses y y must be in M1

Conflict?!

Page 23: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

27 27

Hardware Cache Coherency

P0

$0

P1

$1

write x

read x

write x

invalidate x

copy x

invalidate x

Page 24: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

28 28

Step 3: Resolve Memory Issues int x[100];

int y[100];

int z[100];

PIPELINE {

while (1) {

get(x);

foo(y,x) @ P0;

FIFO(y);

bar(z,y) @ P1;

baz(z) @ P1;

put(z);

}

}

Two versions: y@M0, y@M1

write y@M0 y@M1 is invalid

reads y@M1 Coherence error

Page 25: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

29 29

Step 3: Resolve Memory Issues int x[100];

int y[100];

int z[100];

PIPELINE {

while (1) {

get(x);

foo(y,x) @ P0;

SYNC(x) @ DMA;

FIFO(y);

bar(z,y) @ P1;

baz(z) @ P1;

put(z);

}

}

SYNC(x) @ P copies data from one version of x to another using processor P

read y@M1

y@M1 and y@M0 are valid

Page 26: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

30 30

SoC-C Feature #3: Compile Time Coherency

Variables can have multiple coherent versions Compiler uses memory topology to determine which version

is being accessed

Compiler applies cache coherency protocol Writing to a version makes it valid and other versions invalid

Dataflow analysis propagates validity

Reading from an invalid version is an error

SYNC(x) copies from valid version to invalid version

Page 27: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

31

Compiling SoC-C

See paper:

SoC-C: efficient programming abstractions for heterogeneous multicore systems on chip,  Proceedings of the 2008 international conference on Compilers, architectures and synthesis for embedded systems (CASES) 2008.

(Or view ‘bonus slides’ after talk.)

Page 28: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

32

More realistic SoC-C code

DVB-T Inner ReceiverOFDM receiver 20 tasks 500-7000 cycles each 29000 cycles total

adc_t adc;

ADC_Init(&adc,ADC_BUFSIZE_SAMPLES,adc_Re,adc_Im,13);

SOCC_PIPELINE {

ChannelEstimateInit_DVB_simd(TPS_INFO, CrRe, CrIm) @ DEd;

for(int sym = 0; sym<LOOPS; ++sym) {

cbuffer_t src_r, src_i;

unsigned len = Nguard+asC_MODE[Mode];

ADC_AcquireData(&adc,(sym*len)%ADC_BUFSIZE_SAMPLES,len,&src_r, &src_i);

align(sym_Re,&src_r,len*sizeof(int16_t)) @ DMA_512;

align(sym_Im,&src_i,len*sizeof(int16_t)) @ DMA_512;

ADC_ReleaseRoom(&adc,&src_r,&src_i,len);

RxGuard_DVB_simd(sym_Re,sym_Im,TPS_INFO,Nguard,guarded_Re,guarded_Im) @ DEa;

cscale_DVB_simd(guarded_Re,guarded_Im,23170,avC_MODE[Mode],fft_Re,fft_Im) @ DEa;

fft_DVB_simd(fft_Re,fft_Im,TPS_INFO,ReFFTTwid,ImFFTTwid) @ DEa;

SymUnWrap_DVB_simd(fft_Re,fft_Im,TPS_INFO,unwrapped_Re,unwrapped_Im) @ DEb;

DeMuxSymbol_DVB_simd(unwrapped_Re,unwrapped_Im,TPS_INFO,ISymNum,

demux_Re,demux_Im,PilotsRe,PilotsIm,TPSRe,TPSIm) @ DEb;

DeMuxSymbol_DVB_simd(CrRe,CrIm,TPS_INFO,ISymNum,

demux_CrRe,demux_CrIm,CrPilotsRe,CrPilotsIm,CrTPSRe,CrTPSIm) @ DEb;

cfir1_DVB_simd(demux_Re,demux_Im,demux_CrRe,demux_CrIm,avN_DCPS[Mode],equalized_Re,equalized_Im) @ DEc;

cfir1_DVB_simd(TPSRe,TPSIm,CrTPSRe,CrTPSIm,avN_TPSSCPS[Mode],equalized_TPSRe,equalized_TPSIm) @ DEb;

DemodTPS_DVB_simd(equalized_TPSRe,equalized_TPSIm,TPS_INFO,Pilot,TPSRe) @ DEb;

DemodPilots_DVB_simd(PilotsRe,PilotsIm,TPS_INFO,ISymNum,demod_PilotsRe,demodPilotsIm) @ DEb;

cmagsq_DVB_simd(demux_CrRe,demux_CrIm,12612,avN_DCPS[Mode],MagCr) @ DEc;

int Direction = (ISymNum & 1);

Direction ^= 1;

if (Direction) {

Error=SymInterleave3_DVB_simd2(equalized_Re,equalized_Im,MagCr,

DE_vinterleave_symbol_addr_DVB_T_N,

DE_vinterleave_symbol_addr_DVB_T_OFFSET,

TPS_INFO,Direction,sRe,sIm,sCrMag) @ DEc;

pack3_DVB_simd(sRe,sIm,sCrMag,avN_DCPS[Mode],interleaved_Re,interleaved_Im,Range) @ DEc;

} else {

unpack3_DVB_simd(equalized_Re,equalized_Im,MagCr,avN_DCPS[Mode],sRe,sIm,sCrMag) @ DEc;

Error=SymInterleave3_DVB_simd2(sRe,sIm,sCrMag,

DE_vinterleave_symbol_addr_DVB_T_N,

DE_vinterleave_symbol_addr_DVB_T_OFFSET,

TPS_INFO,Direction,interleaved_Re,interleaved_Im,Range) @ DEc;

}

ChannelEstimate_DVB_simd(interleaved_Re,interleaved_Im,Range,TPS_INFO,CrRe2,CrIm2) @ DEd;

Demod_DVB_simd(interleaved_Re,interleaved_Im,TPS_INFO,Range,demod_softBits) @ DEd;

BitDeInterleave_DVB_simd(demod_softBits,TPS_INFO,deint_softBits) @ DEd;

uint_t err=HardDecoder_DVB_simd(deint_softBits,uvMaxCnt,hardbits) @ DEd;

Bytecpy(&output[p],hardbits,uMaxCnt/8) @ ARM;

p += uMaxCnt/8;

ISymNum = (ISymNum+1) % 4;

}

ADC_Fini(&adc);

Page 29: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

33 33

Parallel Speedup

Efficient Same performance as hand-written code

Near Linear Speedup Very efficient use of parallel hardware

0%

50%

100%

150%

200%

250%

300%

350%

400%

1 2 3 4

Speedup

Page 30: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

34 34

What SoC-C Provides

SoC-C language features Pipeline to support parallelism Coherence to support distributed memory RPC to support multiple processors/ISAs

Non-features Does not choose boundary between pipeline stages Does not resolve coherence problems Does not allocate processors

SoC-C is concise notation to express mapping decisions (not a tool for making them on your behalf)

Page 31: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

35 35

Related Work

Language OpenMP: SMP data parallelism using ‘C plus annotations’ StreamIt: Pipeline parallelism using dataflow language

Pipeline parallelism J.E. Smith, “Decoupled access/execute computer

architectures,” Trans. Computer Systems, 2(4), 1984 Multiple independent reinventions

Hardware Woh et al., “From Soda to Scotch: The Evolution of a

Wireless Baseband Processor,” Proc. MICRO-41, Nov. 2008

Page 32: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

36 36

More Recent Related Work

Mapping applications onto Embedded SoCs Exposing Non-Standard Architectures to Embedded Software

using Compile-Time Virtualization, CASES 2009

Pipeline parallelism The Paralax Infrastructure: Automatic Parallelization with a

Helping Hand, PACT 2010

Page 33: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

37 37

The SoC-C Model

Program as if using SMP system Single multithreaded processor: RPCs provide a “Migrating

thread Model” Single memory: Compiler Managed Coherence handles

“bookkeeping” Annotations change execution, not semantics

Avoid need to restructure code Pipeline parallelism Compiler managed coherence

Efficiency Avoid abstracting expensive operations programmer can optimize and reason about

Page 34: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

38

Kernel Programming

Page 35: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

39

Overview

Example: FIR filter

Hand-vectorized code Optimal performance

Issues

An Alternative Approach

1

0

T

jjiji xhy

Page 36: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

40

Example Vectorized Code

Very fast, efficient code Uses 32-wide SIMD

Each SIMD multiply performs 32 (useful) multiplies

VLIW compiler overlaps operations 3 vector operations per cycle

VLIW compiler performs software pipelining Multiplier active on every cycle

void FIR(vint16_t x[], vint16_t y[], int16_t h[]) {

vint16_t v = x[0];

for (int i=0; i<N/SIMD_WIDTH; ++i) {

vint16_t w = x[i+1];

vint32L_t acc = vqdmull(v,h[0]);

s = vget_lane(w,0);

v = vdown(v,s);

for(int j=1; j<T-1; ++j) {

acc = vqdmlal(acc,v,h[j]);

s = vget_lane(w,j);

v = vdown(v,s);

}

y[i] = vqrdmlah(acc,v,h[j]);

v = w;

}

}

Page 37: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

41

Portability Issues

Vendor specific SIMD operations

vqdmull, vdown, vget_lane

SIMD-width specific

Assumes SIMD_WIDTH >= T

Doesn’t work/performs badly on Many SIMD architectures

GPGPU

SMP

Page 38: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

42

Flexibility issues

Improve arithmetic intensity Merge with adjacent kernel

E.g., if filtering input to FFT, combine with bit reversal

Parallelize task across two Ardbeg engines Requires modification to system-level code

Page 39: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

43

Summary

Programming directly to the processor Produces very high performance code

Kernel is not portable to other processor types

Kernels cannot be remapped to other devices

Kernels cannot be split/merged to improve scheduling or reduce inter-kernel overheads

Often produces local optimum

But misses global optimum

Page 40: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

44

(Towards)Performance-Portable Kernel Programming

Page 41: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

45

Outline

The goal

Quick and dirty demonstration

References to (more complete) versions

What still needs to be done

Page 42: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

46

An alternative approach

Compiler

1

0

T

jjiji xhy

Page 43: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

47

A simple data parallel languageloop(N) {

V1 = load(a);

V2 = load(b);

V3 = add(V1,V2);

store(c,V3);

}

* Currently implemented as a Haskell EDSL – adapted to C-like notation for presentation.

a0 a1 a2 a3 a4 a5 a6 a7 ...V1:

b0 b1 b2 b3 b4 b5 b6 b7 ...V2:

V3:a0+b0

a1+b1

a2+b2

a3+b3

a4+b4

a5+b5

a6+b6

a7+b7

...

c:a0+b0

a1+b1

a2+b2

a3+b3

a4+b4

a5+b5

a6+b6

a7+b7

...

N

Page 44: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

48

Compiling Vector Expressions

Vector Expression Init Generate Next

V1 = load(a); p1=a; V1=vld1(p1); p1+=32;

V2 = load(b); p2=b; V2=vld1(p2); p2+=32;

V3 = add(V1,V2); V3=vadd(V1,V2);

store(c,V3); p3=c; vst1(p3,V3); p3+=32;

p1=a; p2=b; p3=c;for(i=0; i<N; i+=32) { V1=vld1(p1); V2=vld1(p2); V3=vadd(V1,V2); vst1(p3,V3); p1+=32; p2+=32; p3+=32;}

Page 45: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

50

Generating datapath

+1

+1

MemA

MemB

+

+1

MemC

* Warning: this circuit does not adhere to any ARM quality standards.

Page 46: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

51

Adding control

+1

+1

-1

!=0

MemA

MemB

+

+1

MemC

nDone

* Warning: this circuit does not adhere to any ARM quality standards.

en

en

en

en

Page 47: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

52

Fixing timing

+1

+1

-1

!=0

MemA

MemB

+

+1

MemC

nDone

en

enen

en

Page 48: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

53

Related Work

NESL – Nested Data Parallelism (CMU) Cray Vector machines, Connection Machine

DpH – Generalization of NESL as a Haskell library (SLPJ++) GPGPU

Accelerator – Data parallel library in C#/F# (MSR) SMP, DirectX9, FPGA

Array Building Blocks – C++ template library (Intel) SMP, SSE

Thrust – C++ template library (NVidia) GPGPU

(Also: Parallel Skeletons, Map-reduce, etc. etc.)

Page 49: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

54

Summary of approach

(Only) Use highly structured bulk operations Bulk operations reason about vectors, not individual elements

Simple mathematical properties easy to optimize

Single frontend, multiple backends SIMD, SMP, GPGPU, FPGA, ...

(Scope for significant platform-dependent optimization)

Page 50: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

55

Breaking down boundaries

Hard boundary between system and kernel layers Separate languages

Separate tools

Separate people writing/optimizing

Need to soften boundary Allow kernels to be split across processors

Allow kernels to be merged across processors

Allow kernels A and B to agree to use a non-standard memory layout (to run more efficiently)

(This is an open problem)

Page 51: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

56

Tackling Performance Portability Problem

High Performance Embedded Systems Energy Efficient systems are “lumpy”

The hardware is the easy bit

Two level approach System Programming

Stitch kernels together

Inter-kernel parallelism, mapping onto processors/memory

Kernel Programming C+builtins efficient but inflexible, non-portable

Simple DPL in talk, references to more substantial efforts

Intra-kernel parallelism expressed

Boundary must be softened

Page 52: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

57

Fin

Page 53: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

58 58

Language Design Meta Issues

Compiler only uses simple analyses Easier to maintain consistency between different compiler

versions/implementations

Programmer makes the high-level decisions Code and Data Placement

Inserting SYNC

Load balancing

Implementation by many source-source transforms Programmer can mix high- and low-level features

90-10 rule: use high-level features when you can, low-level features when you need to

Page 54: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

59 59

Compiling SoC-C

1. Data Placementa) Infer data placement

b) Propagate coherence

c) Split variables with multiple placement

2. Pipeline Parallelisma) Identify maximal threads

b) Split into multiple threads

c) Apply zero copy optimization

3. RPC (see paper for details)

Page 55: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

60 60

Step 1a: Infer Data Placement int x[100];

int y[100];

int z[100];

PIPELINE {

while (1) {

get(x);

foo(y,x) @ P0;

SYNC(x) @ DMA;

FIFO(y);

bar(z,y) @ P1;

baz(z) @ P1;

put(z);

}

}

Solve Set of Constraints

Page 56: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

61 61

Step 1a: Infer Data Placement int x[100];

int y[100];

int z[100];

PIPELINE {

while (1) {

get(x);

foo(y,x) @ P0;

SYNC(x) @ DMA;

FIFO(y);

bar(z,y) @ P1;

baz(z) @ P1;

put(z);

}

}

Solve Set of Constraints Memory Topology constrains where

variables could live

Page 57: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

62 62

Solve Set of Constraints Memory Topology constrains where

variables could live

Step 1a: Infer Data Placement int x[100] @ {M0};

int y[100] @ {M0,M1};

int z[100] @ {M1};

PIPELINE {

while (1) {

get(x@?);

foo(y@M0, x@M0) @ P0;

SYNC(y,?,?) @ DMA;

FIFO(y@?);

bar(z@M1, y@M1) @ P1;

baz(z@M1) @ P1;

put(z@?);

}

}

Page 58: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

63 63

Solve Set of Constraints Memory Topology constrains where

variables could live

Forwards Dataflow propagates availability of valid versions

Step 1b: Propagate Coherenceint x[100] @ {M0};

int y[100] @ {M0,M1};

int z[100] @ {M1};

PIPELINE {

while (1) {

get(x@?);

foo(y@M0, x@M0) @ P0;

SYNC(y,?,?) @ DMA;

FIFO(y@?);

bar(z@M1, y@M1) @ P1;

baz(z@M1) @ P1;

put(z@?);

}

}

Page 59: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

64 64

Solve Set of Constraints Memory Topology constrains where

variables could live

Forwards Dataflow propagates availability of valid versions

Step 1b: Propagate Coherenceint x[100] @ {M0};

int y[100] @ {M0,M1};

int z[100] @ {M1};

PIPELINE {

while (1) {

get(x@?);

foo(y@M0, x@M0) @ P0;

SYNC(y,?,M0) @ DMA;

FIFO(y@?);

bar(z@M1, y@M1) @ P1;

baz(z@M1) @ P1;

put(z@M1);

}

}

Page 60: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

65 65

Solve Set of Constraints Memory Topology constrains where

variables could live

Forwards Dataflow propagates availability of valid versions

Backwards Dataflow propagates need for valid versions

Step 1b: Propagate Coherenceint x[100] @ {M0};

int y[100] @ {M0,M1};

int z[100] @ {M1};

PIPELINE {

while (1) {

get(x@?);

foo(y@M0, x@M0) @ P0;

SYNC(y,?,M0) @ DMA;

FIFO(y@?);

bar(z@M1, y@M1) @ P1;

baz(z@M1) @ P1;

put(z@M1);

}

}

Page 61: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

66 66

Solve Set of Constraints Memory Topology constrains where

variables could live

Forwards Dataflow propagates availability of valid versions

Backwards Dataflow propagates need for valid versions

Step 1b: Propagate Coherenceint x[100] @ {M0};

int y[100] @ {M0,M1};

int z[100] @ {M1};

PIPELINE {

while (1) {

get(x@M0);

foo(y@M0, x@M0) @ P0;

SYNC(y,M1,M0) @ DMA;

FIFO(y@M1);

bar(z@M1, y@M1) @ P1;

baz(z@M1) @ P1;

put(z@M1);

}

}

Page 62: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

67 67

Step 1c: Split Variablesint x[100] @ {M0}; int y0[100] @ {M0};int y1[100] @ {M1};int z[100] @ {M1};PIPELINE { while (1) { get(x); foo(y0, x) @ P0; memcpy(y1,y0,…) @ DMA; FIFO(y1); bar(z, y1) @ P1; baz(z) @ P1; put(z); }}

Split variables with multiple locations

Replace SYNC with memcpy

Page 63: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

68 68

Step 2: Implement Pipeline Annotationint x[100] @ {M0}; int y0[100] @ {M0};int y1[100] @ {M1};int z[100] @ {M1};PIPELINE { while (1) { get(x); foo(y0, x) @ P0; memcpy(y1,y0,…) @ DMA; FIFO(y1); bar(z, y1) @ P1; baz(z) @ P1; put(z); }}

Dependency Analysis

Page 64: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

69 69

Step 2a: Identify Dependent Operationsint x[100] @ {M0}; int y0[100] @ {M0};int y1[100] @ {M1};int z[100] @ {M1};PIPELINE { while (1) { get(x); foo(y0, x) @ P0; memcpy(y1,y0,…) @ DMA; FIFO(y1); bar(z, y1) @ P1; baz(z) @ P1; put(z); }}

Dependency Analysis

Split use-def chains at FIFOs

Page 65: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

70 70

Step 2b: Identify Maximal Threadsint x[100] @ {M0}; int y0[100] @ {M0};int y1[100] @ {M1};int z[100] @ {M1};PIPELINE { while (1) { get(x); foo(y0, x) @ P0; memcpy(y1,y0,…) @ DMA; FIFO(y1); bar(z, y1) @ P1; baz(z) @ P1; put(z); }}

Dependency Analysis

Split use-def chains at FIFOs

Identify Thread Operations

Page 66: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

71 71

Step 2b: Split Into Multiple Threadsint x[100] @ {M0}; int y0[100] @ {M0};int y1a[100] @ {M1};int y1b[100] @ {M1}; int z[100] @ {M1};PARALLEL { SECTION { while (1) { get(x); foo(y0, x) @ P0; memcpy(y1a,y0,…) @ DMA; fifo_put(&f, y1a); } } SECTION { while (1) { fifo_get(&f, y1b); bar(z, y1b) @ P1; baz(z) @ P1; put(z); } }}

Perform Dataflow Analysis

Split use-def chains at FIFOs

Identify Thread Operations

Split into threads

Page 67: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

72 72

Step 2c: Zero Copy Optimizationint x[100] @ {M0}; int y0[100] @ {M0};int y1a[100] @ {M1};int y1b[100] @ {M1};int z[100] @ {M1};PARALLEL { SECTION { while (1) { get(x); foo(y0, x) @ P0; memcpy(y1a,y0,…) @ DMA; fifo_put(&f, y1a); } } SECTION { while (1) { fifo_get(&f, y1b); bar(z, y1b) @ P1; baz(z) @ P1; put(z); } }}

Generate DataCopy into FIFO

Copy out of FIFOConsume Data

Page 68: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

73 73

Step 2c: Zero Copy Optimizationint x[100] @ {M0}; int y0[100] @ {M0};int y1a[100] @ {M1};int y1b[100] @ {M1};int z[100] @ {M1};PARALLEL { SECTION { while (1) { get(x); foo(y0, x) @ P0; memcpy(y1a,y0,…) @ DMA; fifo_put(&f, y1a); } } SECTION { while (1) { fifo_get(&f, y1b); bar(z, y1b) @ P1; baz(z) @ P1; put(z); } }}

Calculate Live Range of variables passed through FIFOs

Live Range of y1a

Live Range of y1b

Page 69: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

74 74

Step 2c: Zero Copy Optimizationint x[100] @ {M0}; int y0[100] @ {M0};int *py1a;int *py1b;int z[100] @ {M1};PARALLEL { SECTION { while (1) { get(x); foo(y0, x) @ P0; fifo_acquireRoom(&f, &py1a); memcpy(py1a,y0,…) @ DMA; fifo_releaseData(&f, py1a); } } SECTION { while (1) { fifo_acquireData(&f, &py1b); bar(z, py1b) @ P1; fifo_releaseRoom(&f, py1b); baz(z) @ P1; put(z); } }}

Calculate Live Range of variables passed through FIFOs

Transform FIFO operations to pass pointers instead of copying data

Acquire empty buffer

Generate data directly into buffer

Pass full buffer to thread 2

Acquire full buffer from thread 1

Consume data directly from buffer

Release empty buffer

Page 70: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

75 75

Step 3a: Resolve Overloaded RPCint x[100] @ {M0}; int y0[100] @ {M0};int *py1a;int *py1b;int z[100] @ {M1};PARALLEL { SECTION { while (1) { get(x); DE32_foo(0, y0, x); fifo_acquireRoom(&f, &py1a); DMA_memcpy(py1a,y0,…); fifo_releaseData(&f, py1a); } } SECTION { while (1) { fifo_acquireData(&f, &py1b); DE32_bar(1, z, py1b); fifo_releaseRoom(&f, py1b); DE32_baz(1, z); put(z); } }}

Replace RPC by architecture specific call

bar(…)@P1 DE32_bar(1,…)

Page 71: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

76 76

Step 3b: Split RPCsint x[100] @ {M0}; int y0[100] @ {M0};int *py1a;int *py1b;int z[100] @ {M1};

PARALLEL { SECTION { while (1) { get(x); start_DE32_foo(0, y0, x); wait(semaphore_DE32[0]); fifo_acquireRoom(&f, &py1a); start_DMA_memcpy(py1a,y0,…); wait(semaphore_DMA); fifo_releaseData(&f, py1a); } } SECTION { while (1) { fifo_acquireData(&f, &py1b); start_DE32_bar(1, z, py1b); wait(semaphore_DE32[1]); fifo_releaseRoom(&f, py1b); start_DE32_baz(1, z); wait(semaphore_DE32[1]); put(z); } }}

RPCs have two phases

start RPC

wait for RPC to complete

DE32_foo(0,…);

start_DE32_foo(0,…);

wait(semaphore_DE32[0]);

Page 72: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

77 77

Order of transformations

Dataflow-sensitive transformations go first Inferring data placement

Coherence checking within threads

Dependency analysis for parallelism

Parallelism transformations Obscures data and control flow

Thread-local optimizations go last Zero-copy optimization of FIFO operations

Continuation passing thread implementation

Page 73: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

78

Aside: Why hardware companies are fun

You get to play with cool hardware Often before it has been debugged

You get to play with powerful debugging tools Incredible level of detail visible

E.g., Palladium traces on next slides

Page 74: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

79 79

Unoptimized task scheduling

968

DE0

DE1

fft demod

195 cycles

273 cycles

Page 75: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

80 80

Optimized device driver on ARM

968

DE0

DE1

fft demod

155 cycles

257 cycles

Page 76: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem

81 81

Task scheduling hardware support

968

DE0

DE1

fft demod

1 cycle

303 cycles

183 cycles