Orchestrating the Execution of Stream Programs on Multicore Platforms

1 University of MichiganElectrical Engineering and Computer Science

Orchestrating the Execution of Stream Orchestrating the Execution of Stream Programs on Multicore PlatformsPrograms on Multicore Platforms

Manjunath Kudlur, Scott MahlkeAdvanced Computer Architecture Lab.

University of Michigan


Courtesy: Gordon’06

Cores are the New GatesCores are the New Gates

1

1975

2

4

8

16

32

64

128

256

512

1980 1985 1990 1995 2000 2005 2010

400480088080 8086 286 386 486 Pentium P2 P3 P4

Athlon Itanium Itanium2

Power4 PA8800400480088080

PA8800

Opteron CoreDuo

Power6Xbox 360

BCM 1480Opteron 4P

Xeon

Niagara CellRAW

RAZA XLR Cavium

Unicore

Homogeneous Multicore

Heterogeneous MulticoreCISCO CSR1

Larrabee

PicoChip AMBRIC

AMD Fusion

NVIDIA G80

Core

Core2Duo

Core2Quad

# co

res/

chip

(Shekhar Borkar, Intel)

Courtesy: Gordon’06

C/C++/Java

CUDA

X10Peakstream

Fortress

AcceleratorCt

C T M

RstreamRapidmind

Stream Programming


Stream ProgrammingStream Programming• Programming style

– Embedded domain• Audio/video (H.264), wireless (WCDMA)

– Mainstream• Continuous query processing (IBM

SystemS), Search (Google Sawzall)• Stream

– Collection of data records• Kernels/Filters

– Functions applied to streams– Input/Output are streams– Coarse grain dataflow– Amenable to aggressive compiler

optimizations [ASPLOS’02, ’06, PLDI ’03]


Compiling Stream ProgramsCompiling Stream Programs

Core 1 Core 2 Core 3 Core 4

Mem Mem Mem Mem

?

Stream Program Multicore System

• Heavy lifting• Equal work distribution• Communication• Synchronization


Stream Graph Modulo Scheduling(SGMS)Stream Graph Modulo Scheduling(SGMS)• Coarse grain software pipelining

– Equal work distribution– Communication/computation overlap

• Target : Cell processor– Cores with disjoint address spaces– Explicit copy to access remote data

• DMA engine independent of PEs

• Filters = operations, cores = function units

SPU

256 KB LS

MFC(DMA)

SPU

256 KB LS

MFC(DMA)

SPU

256 KB LS

MFC(DMA)

EIB

PPE(Power PC)

DRAM

SPE0 SPE1 SPE7


Streamroller Software OverviewStreamroller Software Overview

Stream IRProfiling

Analysis

Scheduling

CodeGeneration

Streamroller

StreamItSPEX(C++ with

stream extensions)Stylized C

CustomApplication

Engine

SODA(Low power multicoreprocessor for SDR)

Multicore(Cell, Core2Quad,

Niagara)

Focus of this talk


PreliminariesPreliminaries• Synchronous Data Flow (SDF) [Lee ’87]• StreamIt [Thies ’02]

int->int filter FIR(int N, int wgts[N]) {

work pop 1 push 1 { int i, sum = 0;

for(i=0; i<N; i++) sum += peek(i)*wgts[i]; push(sum); pop(); }}

Push and pop items from input/output FIFOs

Statelessint wgts[N];

wgts = adapt(wgts);

Stateful


SGMS OverviewSGMS Overview

PE0 PE1 PE2 PE3

PE0

T1

T4

T4

T1 ≈ 4

DMA

DMA

DMA

DMA

DMA

DMA

DMA

DMA

Prologue

Epilogue


SGMS PhasesSGMS Phases

Fission +Processor

assignment

Stageassignment

Codegeneration

Load balance CausalityDMA overlap


Processor AssignmentProcessor Assignment• Assign filters to processors

– Goal : Equal work distribution• Graph partitioning?• Bin packing?

A

B

D

C

A

B C

D

5

5

40 10

Original stream program

PE0 PE1

B

ACD Speedup = 60/40 = 1.5

A

B1

D

C

B2

J

S

Modified stream program

B2

CJ

B1

ASD Speedup = 60/32 ~ 2


Filter Fission ChoicesFilter Fission Choices

PE0 PE1 PE2 PE3

Speedup ~ 4 ?


Integrated Fission + PE AssignIntegrated Fission + PE Assign• Exact solution based on Integer Linear Programming (ILP)

…

Split/Join overheadfactored in

• Objective function- Maximal load on any PE– Minimize

• Result– Number of times to “split” each

filter– Filter → processor mapping



Fission +Processor

assignment

Stageassignment

Codegeneration



Forming the Software PipelineForming the Software Pipeline• To achieve speedup

– All chunks should execute concurrently– Communication should be overlapped

• Processor assignment alone is insufficient information

A

B

C

A

CB

PE0 PE1

PE0 PE1

Tim

e AB

A1

B1

A2

A1

B1

A2A→B

A1

B1

A2A→B

A3A→B

Overlap Ai+2 with Bi

X


Stage AssignmentStage Assignment

i

j

PE 1

Sj ≥ Si

i

j

DMA

PE 1

PE 2

Si

SDMA > Si

Sj = SDMA+1

Preserve causality(producer-consumer dependence)

Communication-computationoverlap

• Data flow traversal of the stream graph– Assign stages using above two rules


Stage Assignment ExampleStage Assignment Example

A

B1

D

C

B2

J

S

AS

B1Stage 0

DMA DMA DMA Stage 1

CB2

J

Stage 2

D

DMA Stage 3

Stage 4

DMA

PE 0

PE 1



Fission +Processor

assignment

Stageassignment

Codegeneration



Code Generation for CellCode Generation for Cell• Target the Synergistic Processing Elements (SPEs)

– PS3 – up to 6 SPEs– QS20 – up to 16 SPEs

• One thread / SPE• Challenge

– Making a collection of independent threads implement a software pipeline

– Adapt kernel-only code schema of a modulo schedule


Complete ExampleComplete Examplevoid spe1_work(){ char stage[5] = {0}; stage[0] = 1; for(i=0; i<MAX; i++) { if (stage[0]) { A(); S(); B1(); } if (stage[1]) { } if (stage[2]) { JtoD(); CtoD(); } if (stage[3]) { } if (stage[4]) { D(); } barrier(); }}

AS

B1

DMA DMA DMA

CB2

J

D

DMA DMA

AS

B1

AS

B1

B1toJStoB2AtoC

AS

B1

B2

J

C

B1toJStoB2AtoC

AS

B1

JtoDCtoD B2

J

C

B1toJStoB2AtoC

AS

B1

JtoD

D

CtoD B2

J

C

B1toJStoB2AtoC

SPE1 DMA1 SPE2 DMA2

Tim

e


ExperimentsExperiments• StreamIt benchmarks

– Signal processing – DCT, FFT, FMRadio, radar– MPEG2 decoder subset– Encryption – DES– Parallel sort – Bitonic– Range of 30 to 90 filters/benchmark

• Platform– QS20 blade server – 2 Cell chips, 16 SPEs

• Software– StreamIt to C – gcc 4.1– IBM Cell SDK 2.1


Evaluation on QS20Evaluation on QS20

0

2

4

6

8

10

12

14

16

18

bitonic channel dct des fft f ilterbank fmradio tde mpeg2 vocoder radar geomean

Benchmarks

Rel

ativ

e Sp

eedu

p

2P 4P 8P 16P

Split/join overhead reduces benefit from fission

Barrier synchronization (1 per iteration of the stream graph)


SGMS(ILP) vs. GreedySGMS(ILP) vs. Greedy

0

1

2

3

4

5

6

7

8

9

bitonic channel dct des fft filterbank fmradio tde mpeg2 vocoder radar

Benchmarks

Rela

tive

Spee

dup

ILP Partitioning Greedy Partitioning Exposed DMA

(MIT method, ASPLOS’06)

• Solver time < 30 seconds for 16 processors


ConclusionsConclusions• Streamroller

– Efficient mapping of stream programs to multicore– Coarse grain software pipelining

• Performance summary– 14.7x speedup on 16 cores– Up to 35% better than greedy solution (11% on average)

• Scheduling framework– Tradeoff memory space vs. load balance

• Memory constrained (embedded) systems• Cache based system




PE4PE3PE2

Naïve UnfoldingNaïve Unfolding

A

B C

D E

F

A

B C

D E

F

A

B C

D E

F

A

B C

D E

F

PE1

Completely stateless stream program

Stream program with stateful nodes

Stream data dependence

State data dependence

A

B C

D E

F

A

B C

D E

F

A

B C

D E

F

A

B C

D E

F

PE1 PE2 PE3 PE4

DM

AD

MA

DM

AD

MA

DM

AD

MA

DM

AD

MA


SGMS (ILP) vs. GreedySGMS (ILP) vs. Greedy

0

1

2

3

4

5

6

7

8

9


Benchmarks

Rela

tive

spee

dup

ILP partitioning Greedy partitioning

• Solver time < 30 seconds for 16 processors

• Highly dependent on graph structure• DMA overlapped in both ILP and Greedy

• Speedup drops with exposed DMA

(MIT method, ASPLOS’06)


Unfolding vs. SGMSUnfolding vs. SGMS

0

2

4

6

8

10

12

14

16

18

Benchmarks

Rela

tive

Spee

dup

Unfolding SGMS


Pros and Cons of UnfoldingPros and Cons of Unfolding• Pros

– Simple– Good speedups for mostly stateless programs

• Cons

A

B

CD Sync + DMA

A

B

CDA

B

CD

Sync + DMAA

B

CD

Sync + DMA

A

B

Sync + DMA

CDA

B

CD

Sync + DMA

PE 1 PE 2 PE 3

A

B C

D

All input data need to be available for iterations to begin

• Long latency for one iteration• Not suitable for real time scheduling

Sync+DMASpeedup affected by filters with

large state


SGMS AssumptionsSGMS Assumptions• Data independent filter work functions

– Static schedule• High bandwidth interconnect

– EIB 96 Bytes/cycle– Low overhead DMA, low observed latency

• Cores can run MIMD efficiently– Not (directly) suitable for GPUs


Speedup UpperboundsSpeedup Upperbounds

0

5

10

15

20

25

30

35

40

45

50


Benchmarks

Rela

tive

Spee

dup

2P 4P 8P 16P 32P 64P

15% work in one stateless filterMax speedup = 100/15 = 6.7


SummarySummary• 14.7x speedup on 16 cores

– Preprocessing steps (fission) necessary– Hiding communication important

• Extensions– Scaling to more cores– Constrained memory system

• Other targets– SODA, FPGA, GPU


Stream ProgrammingStream Programming• Algorithmic style suitable for

– Audio/video processing– Signal processing– Networking (packet processing,

encryption/decryption)– Wireless domain, software defined radio

• Characteristics– Independent filters

• Segregated address space• Independent threads of control

– Explicit communication– Coarse grain dataflow– Amenable to aggressive compiler

optimizations [ASPLOS’02, ’06, PLDI ’03,’08]

Motiondecode

1D-DCT(row)

Boundedsaturate

InverseQuant AC

InverseQuant DC

Saturate1D-DCT(column)

Bitstream parser

Zigzagunorder

MPEG Decoder


Effect of Exposed DMAEffect of Exposed DMA

0

2

4

6

8

10

12

14

16

18


Benchmarks

Rela

tive

spee

dup

SGMS SGMS with exposed DMA


Scheduling via Unfolding EvaluationScheduling via Unfolding Evaluation• IBM SDK 2.1, pthreads, gcc 4.1.1, QS20 Blade server• StreamIt benchmarks characteristics

Benchmark Total filters Stateful PeekingState size

(bytes)bitonic 40 1 0 4channel 55 1 34 252dct 40 1 0 4des 55 0 0 4fft 17 1 0 4filterbank 85 1 32 508fmradio 43 1 14 508tde 55 1 0 4mpeg2 39 1 0 4vocoder 54 11 6 112radar 57 44 0 1032


Absolute PerformanceAbsolute Performance

BenchmarkGFLOPS with

16 SPEsbitonic N/A

channel 18.72949

dct 17.2353

des N/A

fft 1.65881

filterbank 18.41478

fmradio 17.05445

tde 6.486601

mpeg2 N/A

vocoder 5.488236

radar 30.29937


0_0 a1,0,0,i – b1,0 = 0Σi=1

P

1_0 1_1

1_2

1_3

Σi=1

P

Σj=0

3

a1,1,j,i – b1,1 ≤ Mb1,1

Σi=1

P

Σj=0

3

a1,1,j,i – b1,1 – 2 ≥ -M + Mb1,1

2_0 2_1 2_2

2_3

2_4

Σi=1

P

Σj=0

4

a1,2,j,i – b1,2 ≤ Mb1,2

Σi=1

P

Σj=0

4

a1,2,j,i – b1,2 – 3 ≥ -M + Mb1,2

b1,0 + b1,1 + b1,2 = 1

Original actor

Fissed 2x

Fissed 3x


SPE Code TemplateSPE Code Templatevoid spe_work(){ char stage[N] = {0, 0,...,0}; stage[0] = 1;

for (i=0; i<max_iter+N-1; i++) { if (stage[N-1]) { } if (stage[N-2]) {

} ... if (stage[0]) { Start_DMA(); FilterA_work(); } if (i == max_iter-1) stage[0] = 0; for(j=N-1; j>=1; j--) stage[j] = stage[j-1]; wait_for_dma(); barrier(); }}

Bit mask to control active stages

Activate Stage 0

Left shift bit mask(activate more stages)

Start DMA operationCall filter work function

Poll for completion of all outstanding DMAsBarrier synchronization

Bit mask controls whatgets executed

Go through all input items

Documents

Orchestrating the Execution of Stream Programs on Multicore Platforms