Upload
eljah
View
42
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Flexible Filters for High Performance Embedded Computing. Rebecca Collins and Luca Carloni Department of Computer Science Columbia University. Motivation. Stream Programming. Stream Programming model filter : a piece of sequential programming channels : how filters communicate - PowerPoint PPT Presentation
Citation preview
Flexible Filters for High Performance
Embedded ComputingRebecca Collins and Luca CarloniDepartment of Computer Science
Columbia University
Motivation
Stream Programming
• Stream Programming model– filter: a piece of sequential programming– channels: how filters communicate– token: an indivisible unit of data for a filter
• Examples: Signal processing, image processing, embedded applications
CBA
Example: Constant False-Alarm Rate (CFAR) Detection
guard cells cell under test
compare to with additional factors• number of gates• threshold (μ)• number of guard cells• rows and other dimensional data
http://www.ll.mit.edu/HPECchallenge/
HPEC Challenge
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
uInt tofloat square right
windowleft
window
add
aligndata
findtargets
CFAR Benchmark
Data Stream:
uInt tofloat square right
windowleft
window
add
aligndata
findtargets1
CFAR Benchmark
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
uInt tofloat square right
windowleft
window
add
aligndata
findtargets2 1
CFAR Benchmark
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
uInt tofloat square right
windowleft
window
add
aligndata
findtargets
1
23 1
CFAR Benchmark
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
cell under test
uInt tofloat square right
windowleft
window
add
aligndata
findtargets1
67891011
23910111213
8
CFAR Benchmark
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
t
right window
left window
t
Mapping
Throughput: rate of processing data tokens
multi-core chip
core 1
A
core 2
B
core 3
C
CBA
Mapping
Throughput: rate of processing data tokens
multi-core chip
core 1
AB
core 3
C
CBA
core 1 core 2 core 3
Unbalanced Flow
• Bottlenecks reduce throughput– Caused by backpressure
• inherent algorithmic imbalances• data-dependent computational spikes
CBA
“wait”
Data Dependent Execution Time: CFAR
• Using Set 1 from the HPEC CFAR Kernel Benchmark
• Over a block of about 100 cells
• Extra workload of 32 microseconds per target
7.3 % targets
0
5
10
15
0 96 192 288 384 480 576
Execution Time (microseconds)
Perc
ent
1.3 % targets
05
101520253035
2 128 254 380 506Execution Time (microseconds)
Perc
ent
0.3 % targets
0
20
40
60
80
2 128 255 381 508Execution Time (microseconds)
Perc
ent
Data Dependent Execution Time
Some other examples:• Bloom Filters (Financial, Spam detection)• Compression (Image processing)
0123456789
0.000 0.001 0.002 0.003 0.004 0.005
Execution Time(s)
Perc
ent
Our Solution: Flexible Filters
• Unused cycles on B are filled by working ahead on filter C with the data already present on B
• Push stream flow upstream of a bottleneck• Semantic Preservation
BA C
C
core 1 core 2 core 3
flex-split flex-merge
Related Works• Static Compiler Optimizations
– StreamIt [Gordon et al, 2006]• Dynamic Runtime Load Balancing
– Work Stealing/Filter Migration • [Kakulavarapu et al., 2001]• Cilk [Frigo et al., 1998]• Flux [Shah et al., 2003]• Borealis [Xing et al., 2005]
– Queue based load balancing• Diamond [Huston et al., 2005] distributed search, queue based load
balancing, filter re-ordering
• Combination Static+Dynamic– FlexStream [Hormati et al., 2009] multiple competing programs
Outline• Introduction• Design Flow of a Stream Program with
Flexibility • Performance• Implementation of Flexible Filters• Experiments
– CFAR Case Study
Design Flow of a Stream Program with Flexible Filters
compiler/design tools
Design stream algorithm
Mapping• Filters
• Memory
Add Flexibility to Bottlenecks
Profile
Profile of CFAR filters on Cell
0 1 2 3 4 5
find targets
add
right window
uIntToFloat
Average Execution Time (microseconds) per 100 tokens
Design ConsiderationsDesign stream
algorithm
Mapping• Filters
• Memory
Add Flexibility to Bottlenecks
Profile
Adding Flexibility
Design stream algorithm
Mapping• Filters
• Memory
Add Flexibility to Bottlenecks
Profile
Outline• Introduction• Design Flow of a Stream Program with
Flexibility • Performance• Implementation of Flexible Filters• Experiments
– CFAR Case Study
Mapping Stream Programs to Multi-Core Platforms
BA C
BA C
BA C
BA C
pipeline mapping
SPMD mapping
BA C
sharing a core
core 1
core 1
core 2 core 3
core 3
core 2
core 2
core 1
Throughput: SPMD Mapping
BA C
BA C
BA C
SPMD mapping
t0 t1 t2 t3 t4 t5 t6
core 1
core 2
core 3
A1
A2
A3
B1
B2
B3
C1
C2
C3
3 tokens processed in 7 timesteps,ideal throughput = 3/7 = 0.429
Suppose EA = 2, EB = 2, EC = 3
core 2
core 1
core 3
2
3
1 EA=2 EB=2 EC=3
Throughput: Pipeline Mapping
CBAcore 1 core 2 core 3
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9
core 1core 2core 3
latency 2 latency 2 latency 3
1
A1 A2B1
C1B2A3
C2B3A4
12 123 234
throughput = 1/3 = 0.333 < 0.429
Data Blocks
CBAcore 1 core 2 core 3
C
data block: group of data tokens
Throughput: Pipeline Augmented with Flexibility
CBAcore 1 core 2 core 3
C
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9
core 1core 2core 3
A1 A2B1
C1B2A3
C2B3A4
C2
throughput = 2/5 = 0.4 < 0.429 (but > 0.333)
1 12 123 234
Outline• Introduction• Design Flow of a Stream Program with
Flexibility • Performance• Implementation of Flexible Filters• Experiments
– CFAR Case Study
Flex-SplitFlex-split
pop data block b from inn0 = available space on out0n1 = |b| - n0send n0 to out0, n1 to out1send n0 0’s, then n1 1’s to
select
CB
core 2 core 3
C
flex split flex mergeC
C
out
out1
in0
in1
select
maintain ordering based on run-time state of queues
out0in
Flex-MergeFlex-merge
pop i from selectif i is 0, pop token from in0if i is 1, pop token from in1push token to out
CB
core 2 core 3
C
flex split flex mergeC
C
out
out1
in0
in1
select
out0in Overhead of Flexibility?
Multi-Channel Flex-Split and Flex-Merge
flex split filter
select
…filterflex
flex merge
flex merge
flex merge
output channel 1
output channel 2
output channel n
output channel 1
output channel 2
output channel n
Multi-Channel Flex-Split and Flex-Merge
filter
…input channel 1
input channel 2
input channel n
Multi-Channel Flex-Split and Flex-Merge
filter
…filterflex
input channel 1
input channel 2
input channel n
flex merge
select
flex split
Centralized
Multi-Channel Flex-Split and Flex-Merge
filter
…filterflex
input channel 1
input channel 2
input channel n
flex merge
select
flex split
filter
… filterflex
input channel 1
input channel 2
input channel n
flex mergeflex split
Centralized
β flex split
β flex split
select
Distributed
Outline• Introduction• Design Flow of a Stream Program with
Flexibility • Performance• Implementation of Flexible Filters• Experiments
– CFAR Case Study
Cell BE Processor• Distributed Memory• Heterogeneous
– 8 SIMD (SPU) cores– 1 PowerPC (PPU)
• Element Interconnect Bus– 4 rings– 205 Gb/s
• Gedae Programming LanguageCommunication Layer
Gedae• Commercial data-flow language and programming GUI • Performance analysis tools
CFAR BenchmarkuInt tofloat square right
windowleft
window
add
aligndata
findtargets
Profile of CFAR filters on Cell
0 1 2 3 4 5
find targets
add
right window
uIntToFloat
Average Execution Time (microseconds) per 100 tokens
Data Dependency• By changing threshold, change % targets
– 1.3 % – 7.3 %
• Additional workload per target– 16 µs– 32 µs– 64 µs
1.3 7.316 μs 0.82 1.4532 μs 1.06 1.3964 μs 1.27 1.47
% Targets
AdditionalWorkload
More BenchmarksBenchmark Field Results
DedupInformation
TheoryRabin block/ max chunk size Speedup
4096/512 2.00
JPEG Image Processing
Image width x height128x128256x256512x512
1.311.161.25
Value-at-Risk
Financestocks/walks/timesteps
16/1024/102464/1024/1024
128/1024/1024
0.981.561.55
Conclusions
• Flexible filters – adapt to data dependent bottlenecks– distributed load balancing– provide speedup without modification to
original filters– can be implemented on top of general
stream languages