38
1 University of Michigan Electrical Engineering and Computer Science Orchestrating the Execution Orchestrating the Execution of Stream Programs on of Stream Programs on Multicore Platforms Multicore Platforms Manjunath Kudlur, Scott Mahlke Advanced Computer Architecture Lab. University of Michigan

Orchestrating the Execution of Stream Programs on Multicore Platforms

  • Upload
    ahava

  • View
    90

  • Download
    1

Embed Size (px)

DESCRIPTION

Orchestrating the Execution of Stream Programs on Multicore Platforms. Manjunath Kudlur, Scott Mahlke Advanced Computer Architecture Lab. University of Michigan. 512. PicoChip. AMBRIC. 256. CISCO CSR1. 128. NVIDIA G80. Larrabee. 64. Unicore. 32. Homogeneous Multicore. RAZA XLR. - PowerPoint PPT Presentation

Citation preview

Page 1: Orchestrating the Execution of Stream Programs on Multicore Platforms

1 University of MichiganElectrical Engineering and Computer Science

Orchestrating the Execution of Stream Orchestrating the Execution of Stream Programs on Multicore PlatformsPrograms on Multicore Platforms

Manjunath Kudlur, Scott MahlkeAdvanced Computer Architecture Lab.

University of Michigan

Page 2: Orchestrating the Execution of Stream Programs on Multicore Platforms

2 University of MichiganElectrical Engineering and Computer Science

Courtesy: Gordon’06

Cores are the New GatesCores are the New Gates

1

1975

2

4

8

16

32

64

128

256

512

1980 1985 1990 1995 2000 2005 2010

400480088080 8086 286 386 486 Pentium P2 P3 P4

Athlon Itanium Itanium2

Power4 PA8800400480088080

PA8800

Opteron CoreDuo

Power6Xbox 360

BCM 1480Opteron 4P

Xeon

Niagara CellRAW

RAZA XLR Cavium

Unicore

Homogeneous Multicore

Heterogeneous MulticoreCISCO CSR1

Larrabee

PicoChip AMBRIC

AMD Fusion

NVIDIA G80

Core

Core2Duo

Core2Quad

# co

res/

chip

(Shekhar Borkar, Intel)

Courtesy: Gordon’06

C/C++/Java

CUDA

X10Peakstream

Fortress

AcceleratorCt

C T M

RstreamRapidmind

Stream Programming

Page 3: Orchestrating the Execution of Stream Programs on Multicore Platforms

3 University of MichiganElectrical Engineering and Computer Science

Stream ProgrammingStream Programming• Programming style

– Embedded domain• Audio/video (H.264), wireless (WCDMA)

– Mainstream• Continuous query processing (IBM

SystemS), Search (Google Sawzall)• Stream

– Collection of data records• Kernels/Filters

– Functions applied to streams– Input/Output are streams– Coarse grain dataflow– Amenable to aggressive compiler

optimizations [ASPLOS’02, ’06, PLDI ’03]

Page 4: Orchestrating the Execution of Stream Programs on Multicore Platforms

4 University of MichiganElectrical Engineering and Computer Science

Compiling Stream ProgramsCompiling Stream Programs

Core 1 Core 2 Core 3 Core 4

Mem Mem Mem Mem

?

Stream Program Multicore System

• Heavy lifting• Equal work distribution• Communication• Synchronization

Page 5: Orchestrating the Execution of Stream Programs on Multicore Platforms

5 University of MichiganElectrical Engineering and Computer Science

Stream Graph Modulo Scheduling(SGMS)Stream Graph Modulo Scheduling(SGMS)• Coarse grain software pipelining

– Equal work distribution– Communication/computation overlap

• Target : Cell processor– Cores with disjoint address spaces– Explicit copy to access remote data

• DMA engine independent of PEs

• Filters = operations, cores = function units

SPU

256 KB LS

MFC(DMA)

SPU

256 KB LS

MFC(DMA)

SPU

256 KB LS

MFC(DMA)

EIB

PPE(Power PC)

DRAM

SPE0 SPE1 SPE7

Page 6: Orchestrating the Execution of Stream Programs on Multicore Platforms

6 University of MichiganElectrical Engineering and Computer Science

Streamroller Software OverviewStreamroller Software Overview

Stream IRProfiling

Analysis

Scheduling

CodeGeneration

Streamroller

StreamItSPEX(C++ with

stream extensions)Stylized C

CustomApplication

Engine

SODA(Low power multicoreprocessor for SDR)

Multicore(Cell, Core2Quad,

Niagara)

Focus of this talk

Page 7: Orchestrating the Execution of Stream Programs on Multicore Platforms

7 University of MichiganElectrical Engineering and Computer Science

PreliminariesPreliminaries• Synchronous Data Flow (SDF) [Lee ’87]• StreamIt [Thies ’02]

int->int filter FIR(int N, int wgts[N]) {

work pop 1 push 1 { int i, sum = 0;

for(i=0; i<N; i++) sum += peek(i)*wgts[i]; push(sum); pop(); }}

Push and pop items from input/output FIFOs

Statelessint wgts[N];

wgts = adapt(wgts);

Stateful

Page 8: Orchestrating the Execution of Stream Programs on Multicore Platforms

8 University of MichiganElectrical Engineering and Computer Science

SGMS OverviewSGMS Overview

PE0 PE1 PE2 PE3

PE0

T1

T4

T4

T1 ≈ 4

DMA

DMA

DMA

DMA

DMA

DMA

DMA

DMA

Prologue

Epilogue

Page 9: Orchestrating the Execution of Stream Programs on Multicore Platforms

9 University of MichiganElectrical Engineering and Computer Science

SGMS PhasesSGMS Phases

Fission +Processor

assignment

Stageassignment

Codegeneration

Load balance CausalityDMA overlap

Page 10: Orchestrating the Execution of Stream Programs on Multicore Platforms

10 University of MichiganElectrical Engineering and Computer Science

Processor AssignmentProcessor Assignment• Assign filters to processors

– Goal : Equal work distribution• Graph partitioning?• Bin packing?

A

B

D

C

A

B C

D

5

5

40 10

Original stream program

PE0 PE1

B

ACD Speedup = 60/40 = 1.5

A

B1

D

C

B2

J

S

Modified stream program

B2

CJ

B1

ASD Speedup = 60/32 ~ 2

Page 11: Orchestrating the Execution of Stream Programs on Multicore Platforms

11 University of MichiganElectrical Engineering and Computer Science

Filter Fission ChoicesFilter Fission Choices

PE0 PE1 PE2 PE3

Speedup ~ 4 ?

Page 12: Orchestrating the Execution of Stream Programs on Multicore Platforms

12 University of MichiganElectrical Engineering and Computer Science

Integrated Fission + PE AssignIntegrated Fission + PE Assign• Exact solution based on Integer Linear Programming (ILP)

Split/Join overheadfactored in

• Objective function- Maximal load on any PE– Minimize

• Result– Number of times to “split” each

filter– Filter → processor mapping

Page 13: Orchestrating the Execution of Stream Programs on Multicore Platforms

13 University of MichiganElectrical Engineering and Computer Science

SGMS PhasesSGMS Phases

Fission +Processor

assignment

Stageassignment

Codegeneration

Load balance CausalityDMA overlap

Page 14: Orchestrating the Execution of Stream Programs on Multicore Platforms

14 University of MichiganElectrical Engineering and Computer Science

Forming the Software PipelineForming the Software Pipeline• To achieve speedup

– All chunks should execute concurrently– Communication should be overlapped

• Processor assignment alone is insufficient information

A

B

C

A

CB

PE0 PE1

PE0 PE1

Tim

e AB

A1

B1

A2

A1

B1

A2A→B

A1

B1

A2A→B

A3A→B

Overlap Ai+2 with Bi

X

Page 15: Orchestrating the Execution of Stream Programs on Multicore Platforms

15 University of MichiganElectrical Engineering and Computer Science

Stage AssignmentStage Assignment

i

j

PE 1

Sj ≥ Si

i

j

DMA

PE 1

PE 2

Si

SDMA > Si

Sj = SDMA+1

Preserve causality(producer-consumer dependence)

Communication-computationoverlap

• Data flow traversal of the stream graph– Assign stages using above two rules

Page 16: Orchestrating the Execution of Stream Programs on Multicore Platforms

16 University of MichiganElectrical Engineering and Computer Science

Stage Assignment ExampleStage Assignment Example

A

B1

D

C

B2

J

S

AS

B1Stage 0

DMA DMA DMA Stage 1

CB2

J

Stage 2

D

DMA Stage 3

Stage 4

DMA

PE 0

PE 1

Page 17: Orchestrating the Execution of Stream Programs on Multicore Platforms

17 University of MichiganElectrical Engineering and Computer Science

SGMS PhasesSGMS Phases

Fission +Processor

assignment

Stageassignment

Codegeneration

Load balance CausalityDMA overlap

Page 18: Orchestrating the Execution of Stream Programs on Multicore Platforms

18 University of MichiganElectrical Engineering and Computer Science

Code Generation for CellCode Generation for Cell• Target the Synergistic Processing Elements (SPEs)

– PS3 – up to 6 SPEs– QS20 – up to 16 SPEs

• One thread / SPE• Challenge

– Making a collection of independent threads implement a software pipeline

– Adapt kernel-only code schema of a modulo schedule

Page 19: Orchestrating the Execution of Stream Programs on Multicore Platforms

19 University of MichiganElectrical Engineering and Computer Science

Complete ExampleComplete Examplevoid spe1_work(){ char stage[5] = {0}; stage[0] = 1; for(i=0; i<MAX; i++) { if (stage[0]) { A(); S(); B1(); } if (stage[1]) { } if (stage[2]) { JtoD(); CtoD(); } if (stage[3]) { } if (stage[4]) { D(); } barrier(); }}

AS

B1

DMA DMA DMA

CB2

J

D

DMA DMA

AS

B1

AS

B1

B1toJStoB2AtoC

AS

B1

B2

J

C

B1toJStoB2AtoC

AS

B1

JtoDCtoD B2

J

C

B1toJStoB2AtoC

AS

B1

JtoD

D

CtoD B2

J

C

B1toJStoB2AtoC

SPE1 DMA1 SPE2 DMA2

Tim

e

Page 20: Orchestrating the Execution of Stream Programs on Multicore Platforms

20 University of MichiganElectrical Engineering and Computer Science

ExperimentsExperiments• StreamIt benchmarks

– Signal processing – DCT, FFT, FMRadio, radar– MPEG2 decoder subset– Encryption – DES– Parallel sort – Bitonic– Range of 30 to 90 filters/benchmark

• Platform– QS20 blade server – 2 Cell chips, 16 SPEs

• Software– StreamIt to C – gcc 4.1– IBM Cell SDK 2.1

Page 21: Orchestrating the Execution of Stream Programs on Multicore Platforms

21 University of MichiganElectrical Engineering and Computer Science

Evaluation on QS20Evaluation on QS20

0

2

4

6

8

10

12

14

16

18

bitonic channel dct des fft f ilterbank fmradio tde mpeg2 vocoder radar geomean

Benchmarks

Rel

ativ

e Sp

eedu

p

2P 4P 8P 16P

Split/join overhead reduces benefit from fission

Barrier synchronization (1 per iteration of the stream graph)

Page 22: Orchestrating the Execution of Stream Programs on Multicore Platforms

22 University of MichiganElectrical Engineering and Computer Science

SGMS(ILP) vs. GreedySGMS(ILP) vs. Greedy

0

1

2

3

4

5

6

7

8

9

bitonic channel dct des fft filterbank fmradio tde mpeg2 vocoder radar

Benchmarks

Rela

tive

Spee

dup

ILP Partitioning Greedy Partitioning Exposed DMA

(MIT method, ASPLOS’06)

• Solver time < 30 seconds for 16 processors

Page 23: Orchestrating the Execution of Stream Programs on Multicore Platforms

23 University of MichiganElectrical Engineering and Computer Science

ConclusionsConclusions• Streamroller

– Efficient mapping of stream programs to multicore– Coarse grain software pipelining

• Performance summary– 14.7x speedup on 16 cores– Up to 35% better than greedy solution (11% on average)

• Scheduling framework– Tradeoff memory space vs. load balance

• Memory constrained (embedded) systems• Cache based system

Page 24: Orchestrating the Execution of Stream Programs on Multicore Platforms

24 University of MichiganElectrical Engineering and Computer Science

Page 25: Orchestrating the Execution of Stream Programs on Multicore Platforms

25 University of MichiganElectrical Engineering and Computer Science

Page 26: Orchestrating the Execution of Stream Programs on Multicore Platforms

26 University of MichiganElectrical Engineering and Computer Science

PE4PE3PE2

Naïve UnfoldingNaïve Unfolding

A

B C

D E

F

A

B C

D E

F

A

B C

D E

F

A

B C

D E

F

PE1

Completely stateless stream program

Stream program with stateful nodes

Stream data dependence

State data dependence

A

B C

D E

F

A

B C

D E

F

A

B C

D E

F

A

B C

D E

F

PE1 PE2 PE3 PE4

DM

AD

MA

DM

AD

MA

DM

AD

MA

DM

AD

MA

Page 27: Orchestrating the Execution of Stream Programs on Multicore Platforms

27 University of MichiganElectrical Engineering and Computer Science

SGMS (ILP) vs. GreedySGMS (ILP) vs. Greedy

0

1

2

3

4

5

6

7

8

9

bitonic channel dct des fft filterbank fmradio tde mpeg2 vocoder radar

Benchmarks

Rela

tive

spee

dup

ILP partitioning Greedy partitioning

• Solver time < 30 seconds for 16 processors

• Highly dependent on graph structure• DMA overlapped in both ILP and Greedy

• Speedup drops with exposed DMA

(MIT method, ASPLOS’06)

Page 28: Orchestrating the Execution of Stream Programs on Multicore Platforms

28 University of MichiganElectrical Engineering and Computer Science

Unfolding vs. SGMSUnfolding vs. SGMS

0

2

4

6

8

10

12

14

16

18

Benchmarks

Rela

tive

Spee

dup

Unfolding SGMS

Page 29: Orchestrating the Execution of Stream Programs on Multicore Platforms

29 University of MichiganElectrical Engineering and Computer Science

Pros and Cons of UnfoldingPros and Cons of Unfolding• Pros

– Simple– Good speedups for mostly stateless programs

• Cons

A

B

CD Sync + DMA

A

B

CDA

B

CD

Sync + DMAA

B

CD

Sync + DMA

A

B

Sync + DMA

CDA

B

CD

Sync + DMA

PE 1 PE 2 PE 3

A

B C

D

All input data need to be available for iterations to begin

• Long latency for one iteration• Not suitable for real time scheduling

Sync+DMASpeedup affected by filters with

large state

Page 30: Orchestrating the Execution of Stream Programs on Multicore Platforms

30 University of MichiganElectrical Engineering and Computer Science

SGMS AssumptionsSGMS Assumptions• Data independent filter work functions

– Static schedule• High bandwidth interconnect

– EIB 96 Bytes/cycle– Low overhead DMA, low observed latency

• Cores can run MIMD efficiently– Not (directly) suitable for GPUs

Page 31: Orchestrating the Execution of Stream Programs on Multicore Platforms

31 University of MichiganElectrical Engineering and Computer Science

Speedup UpperboundsSpeedup Upperbounds

0

5

10

15

20

25

30

35

40

45

50

bitonic channel dct des fft filterbank fmradio tde mpeg2 vocoder radar

Benchmarks

Rela

tive

Spee

dup

2P 4P 8P 16P 32P 64P

15% work in one stateless filterMax speedup = 100/15 = 6.7

Page 32: Orchestrating the Execution of Stream Programs on Multicore Platforms

32 University of MichiganElectrical Engineering and Computer Science

SummarySummary• 14.7x speedup on 16 cores

– Preprocessing steps (fission) necessary– Hiding communication important

• Extensions– Scaling to more cores– Constrained memory system

• Other targets– SODA, FPGA, GPU

Page 33: Orchestrating the Execution of Stream Programs on Multicore Platforms

33 University of MichiganElectrical Engineering and Computer Science

Stream ProgrammingStream Programming• Algorithmic style suitable for

– Audio/video processing– Signal processing– Networking (packet processing,

encryption/decryption)– Wireless domain, software defined radio

• Characteristics– Independent filters

• Segregated address space• Independent threads of control

– Explicit communication– Coarse grain dataflow– Amenable to aggressive compiler

optimizations [ASPLOS’02, ’06, PLDI ’03,’08]

Motiondecode

1D-DCT(row)

Boundedsaturate

InverseQuant AC

InverseQuant DC

Saturate1D-DCT(column)

Bitstream parser

Zigzagunorder

MPEG Decoder

Page 34: Orchestrating the Execution of Stream Programs on Multicore Platforms

34 University of MichiganElectrical Engineering and Computer Science

Effect of Exposed DMAEffect of Exposed DMA

0

2

4

6

8

10

12

14

16

18

bitonic channel dct des fft filterbank fmradio tde mpeg2 vocoder radar

Benchmarks

Rela

tive

spee

dup

SGMS SGMS with exposed DMA

Page 35: Orchestrating the Execution of Stream Programs on Multicore Platforms

35 University of MichiganElectrical Engineering and Computer Science

Scheduling via Unfolding EvaluationScheduling via Unfolding Evaluation• IBM SDK 2.1, pthreads, gcc 4.1.1, QS20 Blade server• StreamIt benchmarks characteristics

Benchmark Total filters Stateful PeekingState size

(bytes)bitonic 40 1 0 4channel 55 1 34 252dct 40 1 0 4des 55 0 0 4fft 17 1 0 4filterbank 85 1 32 508fmradio 43 1 14 508tde 55 1 0 4mpeg2 39 1 0 4vocoder 54 11 6 112radar 57 44 0 1032

Page 36: Orchestrating the Execution of Stream Programs on Multicore Platforms

36 University of MichiganElectrical Engineering and Computer Science

Absolute PerformanceAbsolute Performance

BenchmarkGFLOPS with

16 SPEsbitonic N/A

channel 18.72949

dct 17.2353

des N/A

fft 1.65881

filterbank 18.41478

fmradio 17.05445

tde 6.486601

mpeg2 N/A

vocoder 5.488236

radar 30.29937

Page 37: Orchestrating the Execution of Stream Programs on Multicore Platforms

37 University of MichiganElectrical Engineering and Computer Science

0_0 a1,0,0,i – b1,0 = 0Σi=1

P

1_0 1_1

1_2

1_3

Σi=1

P

Σj=0

3

a1,1,j,i – b1,1 ≤ Mb1,1

Σi=1

P

Σj=0

3

a1,1,j,i – b1,1 – 2 ≥ -M + Mb1,1

2_0 2_1 2_2

2_3

2_4

Σi=1

P

Σj=0

4

a1,2,j,i – b1,2 ≤ Mb1,2

Σi=1

P

Σj=0

4

a1,2,j,i – b1,2 – 3 ≥ -M + Mb1,2

b1,0 + b1,1 + b1,2 = 1

Original actor

Fissed 2x

Fissed 3x

Page 38: Orchestrating the Execution of Stream Programs on Multicore Platforms

38 University of MichiganElectrical Engineering and Computer Science

SPE Code TemplateSPE Code Templatevoid spe_work(){ char stage[N] = {0, 0,...,0}; stage[0] = 1;

for (i=0; i<max_iter+N-1; i++) { if (stage[N-1]) { } if (stage[N-2]) {

} ... if (stage[0]) { Start_DMA(); FilterA_work(); } if (i == max_iter-1) stage[0] = 0; for(j=N-1; j>=1; j--) stage[j] = stage[j-1]; wait_for_dma(); barrier(); }}

Bit mask to control active stages

Activate Stage 0

Left shift bit mask(activate more stages)

Start DMA operationCall filter work function

Poll for completion of all outstanding DMAsBarrier synchronization

Bit mask controls whatgets executed

Go through all input items