Upload
ahava
View
90
Download
1
Embed Size (px)
DESCRIPTION
Orchestrating the Execution of Stream Programs on Multicore Platforms. Manjunath Kudlur, Scott Mahlke Advanced Computer Architecture Lab. University of Michigan. 512. PicoChip. AMBRIC. 256. CISCO CSR1. 128. NVIDIA G80. Larrabee. 64. Unicore. 32. Homogeneous Multicore. RAZA XLR. - PowerPoint PPT Presentation
Citation preview
1 University of MichiganElectrical Engineering and Computer Science
Orchestrating the Execution of Stream Orchestrating the Execution of Stream Programs on Multicore PlatformsPrograms on Multicore Platforms
Manjunath Kudlur, Scott MahlkeAdvanced Computer Architecture Lab.
University of Michigan
2 University of MichiganElectrical Engineering and Computer Science
Courtesy: Gordon’06
Cores are the New GatesCores are the New Gates
1
1975
2
4
8
16
32
64
128
256
512
1980 1985 1990 1995 2000 2005 2010
400480088080 8086 286 386 486 Pentium P2 P3 P4
Athlon Itanium Itanium2
Power4 PA8800400480088080
PA8800
Opteron CoreDuo
Power6Xbox 360
BCM 1480Opteron 4P
Xeon
Niagara CellRAW
RAZA XLR Cavium
Unicore
Homogeneous Multicore
Heterogeneous MulticoreCISCO CSR1
Larrabee
PicoChip AMBRIC
AMD Fusion
NVIDIA G80
Core
Core2Duo
Core2Quad
# co
res/
chip
(Shekhar Borkar, Intel)
Courtesy: Gordon’06
C/C++/Java
CUDA
X10Peakstream
Fortress
AcceleratorCt
C T M
RstreamRapidmind
Stream Programming
3 University of MichiganElectrical Engineering and Computer Science
Stream ProgrammingStream Programming• Programming style
– Embedded domain• Audio/video (H.264), wireless (WCDMA)
– Mainstream• Continuous query processing (IBM
SystemS), Search (Google Sawzall)• Stream
– Collection of data records• Kernels/Filters
– Functions applied to streams– Input/Output are streams– Coarse grain dataflow– Amenable to aggressive compiler
optimizations [ASPLOS’02, ’06, PLDI ’03]
4 University of MichiganElectrical Engineering and Computer Science
Compiling Stream ProgramsCompiling Stream Programs
Core 1 Core 2 Core 3 Core 4
Mem Mem Mem Mem
?
Stream Program Multicore System
• Heavy lifting• Equal work distribution• Communication• Synchronization
5 University of MichiganElectrical Engineering and Computer Science
Stream Graph Modulo Scheduling(SGMS)Stream Graph Modulo Scheduling(SGMS)• Coarse grain software pipelining
– Equal work distribution– Communication/computation overlap
• Target : Cell processor– Cores with disjoint address spaces– Explicit copy to access remote data
• DMA engine independent of PEs
• Filters = operations, cores = function units
SPU
256 KB LS
MFC(DMA)
SPU
256 KB LS
MFC(DMA)
SPU
256 KB LS
MFC(DMA)
EIB
PPE(Power PC)
DRAM
SPE0 SPE1 SPE7
6 University of MichiganElectrical Engineering and Computer Science
Streamroller Software OverviewStreamroller Software Overview
Stream IRProfiling
Analysis
Scheduling
CodeGeneration
Streamroller
StreamItSPEX(C++ with
stream extensions)Stylized C
CustomApplication
Engine
SODA(Low power multicoreprocessor for SDR)
Multicore(Cell, Core2Quad,
Niagara)
Focus of this talk
7 University of MichiganElectrical Engineering and Computer Science
PreliminariesPreliminaries• Synchronous Data Flow (SDF) [Lee ’87]• StreamIt [Thies ’02]
int->int filter FIR(int N, int wgts[N]) {
work pop 1 push 1 { int i, sum = 0;
for(i=0; i<N; i++) sum += peek(i)*wgts[i]; push(sum); pop(); }}
Push and pop items from input/output FIFOs
Statelessint wgts[N];
wgts = adapt(wgts);
Stateful
8 University of MichiganElectrical Engineering and Computer Science
SGMS OverviewSGMS Overview
PE0 PE1 PE2 PE3
PE0
T1
T4
T4
T1 ≈ 4
DMA
DMA
DMA
DMA
DMA
DMA
DMA
DMA
Prologue
Epilogue
9 University of MichiganElectrical Engineering and Computer Science
SGMS PhasesSGMS Phases
Fission +Processor
assignment
Stageassignment
Codegeneration
Load balance CausalityDMA overlap
10 University of MichiganElectrical Engineering and Computer Science
Processor AssignmentProcessor Assignment• Assign filters to processors
– Goal : Equal work distribution• Graph partitioning?• Bin packing?
A
B
D
C
A
B C
D
5
5
40 10
Original stream program
PE0 PE1
B
ACD Speedup = 60/40 = 1.5
A
B1
D
C
B2
J
S
Modified stream program
B2
CJ
B1
ASD Speedup = 60/32 ~ 2
11 University of MichiganElectrical Engineering and Computer Science
Filter Fission ChoicesFilter Fission Choices
PE0 PE1 PE2 PE3
Speedup ~ 4 ?
12 University of MichiganElectrical Engineering and Computer Science
Integrated Fission + PE AssignIntegrated Fission + PE Assign• Exact solution based on Integer Linear Programming (ILP)
…
Split/Join overheadfactored in
• Objective function- Maximal load on any PE– Minimize
• Result– Number of times to “split” each
filter– Filter → processor mapping
13 University of MichiganElectrical Engineering and Computer Science
SGMS PhasesSGMS Phases
Fission +Processor
assignment
Stageassignment
Codegeneration
Load balance CausalityDMA overlap
14 University of MichiganElectrical Engineering and Computer Science
Forming the Software PipelineForming the Software Pipeline• To achieve speedup
– All chunks should execute concurrently– Communication should be overlapped
• Processor assignment alone is insufficient information
A
B
C
A
CB
PE0 PE1
PE0 PE1
Tim
e AB
A1
B1
A2
A1
B1
A2A→B
A1
B1
A2A→B
A3A→B
Overlap Ai+2 with Bi
X
15 University of MichiganElectrical Engineering and Computer Science
Stage AssignmentStage Assignment
i
j
PE 1
Sj ≥ Si
i
j
DMA
PE 1
PE 2
Si
SDMA > Si
Sj = SDMA+1
Preserve causality(producer-consumer dependence)
Communication-computationoverlap
• Data flow traversal of the stream graph– Assign stages using above two rules
16 University of MichiganElectrical Engineering and Computer Science
Stage Assignment ExampleStage Assignment Example
A
B1
D
C
B2
J
S
AS
B1Stage 0
DMA DMA DMA Stage 1
CB2
J
Stage 2
D
DMA Stage 3
Stage 4
DMA
PE 0
PE 1
17 University of MichiganElectrical Engineering and Computer Science
SGMS PhasesSGMS Phases
Fission +Processor
assignment
Stageassignment
Codegeneration
Load balance CausalityDMA overlap
18 University of MichiganElectrical Engineering and Computer Science
Code Generation for CellCode Generation for Cell• Target the Synergistic Processing Elements (SPEs)
– PS3 – up to 6 SPEs– QS20 – up to 16 SPEs
• One thread / SPE• Challenge
– Making a collection of independent threads implement a software pipeline
– Adapt kernel-only code schema of a modulo schedule
19 University of MichiganElectrical Engineering and Computer Science
Complete ExampleComplete Examplevoid spe1_work(){ char stage[5] = {0}; stage[0] = 1; for(i=0; i<MAX; i++) { if (stage[0]) { A(); S(); B1(); } if (stage[1]) { } if (stage[2]) { JtoD(); CtoD(); } if (stage[3]) { } if (stage[4]) { D(); } barrier(); }}
AS
B1
DMA DMA DMA
CB2
J
D
DMA DMA
AS
B1
AS
B1
B1toJStoB2AtoC
AS
B1
B2
J
C
B1toJStoB2AtoC
AS
B1
JtoDCtoD B2
J
C
B1toJStoB2AtoC
AS
B1
JtoD
D
CtoD B2
J
C
B1toJStoB2AtoC
SPE1 DMA1 SPE2 DMA2
Tim
e
20 University of MichiganElectrical Engineering and Computer Science
ExperimentsExperiments• StreamIt benchmarks
– Signal processing – DCT, FFT, FMRadio, radar– MPEG2 decoder subset– Encryption – DES– Parallel sort – Bitonic– Range of 30 to 90 filters/benchmark
• Platform– QS20 blade server – 2 Cell chips, 16 SPEs
• Software– StreamIt to C – gcc 4.1– IBM Cell SDK 2.1
21 University of MichiganElectrical Engineering and Computer Science
Evaluation on QS20Evaluation on QS20
0
2
4
6
8
10
12
14
16
18
bitonic channel dct des fft f ilterbank fmradio tde mpeg2 vocoder radar geomean
Benchmarks
Rel
ativ
e Sp
eedu
p
2P 4P 8P 16P
Split/join overhead reduces benefit from fission
Barrier synchronization (1 per iteration of the stream graph)
22 University of MichiganElectrical Engineering and Computer Science
SGMS(ILP) vs. GreedySGMS(ILP) vs. Greedy
0
1
2
3
4
5
6
7
8
9
bitonic channel dct des fft filterbank fmradio tde mpeg2 vocoder radar
Benchmarks
Rela
tive
Spee
dup
ILP Partitioning Greedy Partitioning Exposed DMA
(MIT method, ASPLOS’06)
• Solver time < 30 seconds for 16 processors
23 University of MichiganElectrical Engineering and Computer Science
ConclusionsConclusions• Streamroller
– Efficient mapping of stream programs to multicore– Coarse grain software pipelining
• Performance summary– 14.7x speedup on 16 cores– Up to 35% better than greedy solution (11% on average)
• Scheduling framework– Tradeoff memory space vs. load balance
• Memory constrained (embedded) systems• Cache based system
24 University of MichiganElectrical Engineering and Computer Science
25 University of MichiganElectrical Engineering and Computer Science
26 University of MichiganElectrical Engineering and Computer Science
PE4PE3PE2
Naïve UnfoldingNaïve Unfolding
A
B C
D E
F
A
B C
D E
F
A
B C
D E
F
A
B C
D E
F
PE1
Completely stateless stream program
Stream program with stateful nodes
Stream data dependence
State data dependence
A
B C
D E
F
A
B C
D E
F
A
B C
D E
F
A
B C
D E
F
PE1 PE2 PE3 PE4
DM
AD
MA
DM
AD
MA
DM
AD
MA
DM
AD
MA
27 University of MichiganElectrical Engineering and Computer Science
SGMS (ILP) vs. GreedySGMS (ILP) vs. Greedy
0
1
2
3
4
5
6
7
8
9
bitonic channel dct des fft filterbank fmradio tde mpeg2 vocoder radar
Benchmarks
Rela
tive
spee
dup
ILP partitioning Greedy partitioning
• Solver time < 30 seconds for 16 processors
• Highly dependent on graph structure• DMA overlapped in both ILP and Greedy
• Speedup drops with exposed DMA
(MIT method, ASPLOS’06)
28 University of MichiganElectrical Engineering and Computer Science
Unfolding vs. SGMSUnfolding vs. SGMS
0
2
4
6
8
10
12
14
16
18
Benchmarks
Rela
tive
Spee
dup
Unfolding SGMS
29 University of MichiganElectrical Engineering and Computer Science
Pros and Cons of UnfoldingPros and Cons of Unfolding• Pros
– Simple– Good speedups for mostly stateless programs
• Cons
A
B
CD Sync + DMA
A
B
CDA
B
CD
Sync + DMAA
B
CD
Sync + DMA
A
B
Sync + DMA
CDA
B
CD
Sync + DMA
PE 1 PE 2 PE 3
A
B C
D
All input data need to be available for iterations to begin
• Long latency for one iteration• Not suitable for real time scheduling
Sync+DMASpeedup affected by filters with
large state
30 University of MichiganElectrical Engineering and Computer Science
SGMS AssumptionsSGMS Assumptions• Data independent filter work functions
– Static schedule• High bandwidth interconnect
– EIB 96 Bytes/cycle– Low overhead DMA, low observed latency
• Cores can run MIMD efficiently– Not (directly) suitable for GPUs
31 University of MichiganElectrical Engineering and Computer Science
Speedup UpperboundsSpeedup Upperbounds
0
5
10
15
20
25
30
35
40
45
50
bitonic channel dct des fft filterbank fmradio tde mpeg2 vocoder radar
Benchmarks
Rela
tive
Spee
dup
2P 4P 8P 16P 32P 64P
15% work in one stateless filterMax speedup = 100/15 = 6.7
32 University of MichiganElectrical Engineering and Computer Science
SummarySummary• 14.7x speedup on 16 cores
– Preprocessing steps (fission) necessary– Hiding communication important
• Extensions– Scaling to more cores– Constrained memory system
• Other targets– SODA, FPGA, GPU
33 University of MichiganElectrical Engineering and Computer Science
Stream ProgrammingStream Programming• Algorithmic style suitable for
– Audio/video processing– Signal processing– Networking (packet processing,
encryption/decryption)– Wireless domain, software defined radio
• Characteristics– Independent filters
• Segregated address space• Independent threads of control
– Explicit communication– Coarse grain dataflow– Amenable to aggressive compiler
optimizations [ASPLOS’02, ’06, PLDI ’03,’08]
Motiondecode
1D-DCT(row)
Boundedsaturate
InverseQuant AC
InverseQuant DC
Saturate1D-DCT(column)
Bitstream parser
Zigzagunorder
MPEG Decoder
34 University of MichiganElectrical Engineering and Computer Science
Effect of Exposed DMAEffect of Exposed DMA
0
2
4
6
8
10
12
14
16
18
bitonic channel dct des fft filterbank fmradio tde mpeg2 vocoder radar
Benchmarks
Rela
tive
spee
dup
SGMS SGMS with exposed DMA
35 University of MichiganElectrical Engineering and Computer Science
Scheduling via Unfolding EvaluationScheduling via Unfolding Evaluation• IBM SDK 2.1, pthreads, gcc 4.1.1, QS20 Blade server• StreamIt benchmarks characteristics
Benchmark Total filters Stateful PeekingState size
(bytes)bitonic 40 1 0 4channel 55 1 34 252dct 40 1 0 4des 55 0 0 4fft 17 1 0 4filterbank 85 1 32 508fmradio 43 1 14 508tde 55 1 0 4mpeg2 39 1 0 4vocoder 54 11 6 112radar 57 44 0 1032
36 University of MichiganElectrical Engineering and Computer Science
Absolute PerformanceAbsolute Performance
BenchmarkGFLOPS with
16 SPEsbitonic N/A
channel 18.72949
dct 17.2353
des N/A
fft 1.65881
filterbank 18.41478
fmradio 17.05445
tde 6.486601
mpeg2 N/A
vocoder 5.488236
radar 30.29937
37 University of MichiganElectrical Engineering and Computer Science
0_0 a1,0,0,i – b1,0 = 0Σi=1
P
1_0 1_1
1_2
1_3
Σi=1
P
Σj=0
3
a1,1,j,i – b1,1 ≤ Mb1,1
Σi=1
P
Σj=0
3
a1,1,j,i – b1,1 – 2 ≥ -M + Mb1,1
2_0 2_1 2_2
2_3
2_4
Σi=1
P
Σj=0
4
a1,2,j,i – b1,2 ≤ Mb1,2
Σi=1
P
Σj=0
4
a1,2,j,i – b1,2 – 3 ≥ -M + Mb1,2
b1,0 + b1,1 + b1,2 = 1
Original actor
Fissed 2x
Fissed 3x
38 University of MichiganElectrical Engineering and Computer Science
SPE Code TemplateSPE Code Templatevoid spe_work(){ char stage[N] = {0, 0,...,0}; stage[0] = 1;
for (i=0; i<max_iter+N-1; i++) { if (stage[N-1]) { } if (stage[N-2]) {
} ... if (stage[0]) { Start_DMA(); FilterA_work(); } if (i == max_iter-1) stage[0] = 0; for(j=N-1; j>=1; j--) stage[j] = stage[j-1]; wait_for_dma(); barrier(); }}
Bit mask to control active stages
Activate Stage 0
Left shift bit mask(activate more stages)
Start DMA operationCall filter work function
Poll for completion of all outstanding DMAsBarrier synchronization
Bit mask controls whatgets executed
Go through all input items