Upload
lyngoc
View
225
Download
5
Embed Size (px)
Citation preview
Dataflow programming for heterogeneous computing systems
Jeronimo CastrillonCfaed Chair for Compiler ConstructionTU [email protected]
Tutorial: Algorithmic specification, tools and algorithms for programming heterogeneous platforms. PACT Conference. San Francisco, October 19th 2015
Outline
q Heterogeneous systems
q Dataflow models
q Analysis and synthesis
q Summary
2 © J. Castrillon. Dataflow programming. PACT Tutorial. Oct 2015
Outline
q Heterogeneous systems
q Dataflow models
q Analysis and synthesis
q Summary
3 © J. Castrillon. Dataflow programming. PACT Tutorial. Oct 2015
Heterogeneous computing systems
q Today’s heterogeneityq Desktop/HPC: GPGPU, GP + ACCq Embedded: RISCs, DSPs, ASIPs, …
è Resources: different performance/energy characteristics
q Tomorrow’s heterogeneity: Emerging technologiesq Heterogeneity beyond performance
and energy§ Reliability/error tolerance§ Computation model
4 © J. Castrillon. Dataflow programming. PACT Tutorial. Oct 2015
DMAs, sema-
phoresPMU
Peripherals
Communicationsupport
HW queues
Network Processor
Packet DMA
MEMsubsystem
NoC
A15L1
A15L1
L2A15L1
A15L1
VLIW DSP
L1,L2
5
Heterogeneity: Cfaed Vision
q German Excellence Cluster: Goal – “to explore new technologies for electronic information processing which overcome the limits of CMOS technology”
Materials &Functions
Devices &Circuits
Information Processing
CMOS
(industry focus)
A Silicon Nanowires
B Carbon
C Organic
D BiomolecularAssembly
E Chemical
F Orchestration
G Resilience
HHAEC:
Highly Adaptive
ComputingEnergy-Efficient
IBiologicalSystems
Material research for post-CMOS technologies
Systems research to handle heterogeneity
Biology to inspire solutions
© J. Castrillon. Dataflow programming. PACT Tutorial. Oct 2015
Heterogeneity: Example SiNW
q SiNW: Silicon Nanowiresq Multi-gate devices with less performance penaltyq Reconfigurable P/N functionality
q Possibilitiesq New micro architectures
q New pipeline structuresq New field programmable devices
6 © J. Castrillon. Dataflow programming. PACT Tutorial. Oct 2015
[Trommer15]
Heterogeneity: Example chemical processing
q Lab-on-Chips: for sensing, analysis and testq Also for computing?
q Different kinds of transistorsq Oscillators and other components
7 © J. Castrillon. Dataflow programming. PACT Tutorial. Oct 2015
Fundamentally different way of computation
[Voigt14]
Heterogeneity: Example DNA origami
q DNA origami: Self-assembled 2D/3D structures made of DNA strands
q Use structures to build advance electronic devicesq Example: Plasmonic waveguides
8 © J. Castrillon. Dataflow programming. PACT Tutorial. Oct 2015
Courtesy: Thorsten Lars-Schmidthttps://cfaed.tu-dresden.de/schmidt-home
9
Consequence of heterogeneity
DMAs, sema-
phoresPMU
Peripherals
Communicationsupport
HW queues
Network Processor
Packet DMA
MEMsubsystem
NoC
A15L1
A15L1
L2A15L1
A15L1
VLIW DSP
L1,L2
Already difficult to program them today, what about tomorrow?Need for models and abstraction
© J. Castrillon. Dataflow programming. PACT Tutorial. Oct 2015
Outline
q Heterogeneous systems
q Dataflow models
q Analysis and synthesis
q Summary
10 © J. Castrillon. Dataflow programming. PACT Tutorial. Oct 2015
Models: Introduction
q Von Neumann model makes things complicatedq Sharing stateq Data races
q Task graphs: A simple parallel programming modelq Intel TBB, .NET Task parallel library
(TPL), OpenMP Tasks, …q Runtime and data management, e.g.,
StarPU [Augonnet10]11 © J. Castrillon. Dataflow programming. PACT Tutorial. Oct 2015
Central Processing Unit (CPU)
Memory & I/O
Control unit Processing unitInst. reg
Prog. counterReg. file
ALU
Directed acyclic task graphs
q Very well studied, see for example [Kwok99]q Difficult problem for heterogeneous systems
12 © J. Castrillon. Dataflow programming. PACT Tutorial. Oct 2015
RISC DSP
SharedMemory
Interconnect
Time
Processor
13
24
Time
Processor
14
32
?4
2
1
3
Dataflow models
q Also a graph representation: Nodes & edges are called actors & channelsq Implicit repetition, common in streaming, signal processing applicationsq Communication: only through channelsq Multiple flavors of models: Rules that determine when an actor firesq A graph models multiple possible executions:
13 © J. Castrillon. Dataflow programming. PACT Tutorial. Oct 2015
4
2
1
3Time
Processor
1
432
111
1
22
2
3
3
3
…
cf. LabVIEW models
Dataflow models (2)
q Also a graph representation: Nodes & edges are called actors & channelsq Implicit repetition, common in streaming, signal processing applicationsq Communication: only through channelsq Multiple flavors of models: Rules that determine when an actor firesq A graph models multiple possible executions:
14 © J. Castrillon. Dataflow programming. PACT Tutorial. Oct 2015
Time
Processor
4
2
1
3 1
32
1122
3 3…
4 4
What now?
Dataflow models (3)
q Synchronous Dataflow (SDF): every actor has a fixed behavior
q Cyclo-Static Dataflow (CSDF): every actor has a set of fixed behaviors
15 © J. Castrillon. Dataflow programming. PACT Tutorial. Oct 2015
a3 always writes 1 token to e4
a1: writes 1 token, then 0, then 0 to e2
a1 a2 a33 1 2 3
6 2
12
e1 e3
e4
e2
a1 a2 a31{0,2}
1
{1,0,0}
e1
e3
e21
{0,1,0}
Dynamic models and Kahn Process Networks
q Dynamic dataflow: set of firing rules per actor
q Kahn process networks (KPN): nodes are now called processes
16 © J. Castrillon. Dataflow programming. PACT Tutorial. Oct 2015
a1 a2 a3i1
i2
p1 p2 p3e1 e3
e4
e2
p1: writes any amount of tokens to e2 at any time
Outline
q Heterogeneous systems
q Dataflow models
q Analysis and synthesis
q Summary
17 © J. Castrillon. Dataflow programming. PACT Tutorial. Oct 2015
18
Analysis and synthesis
© J. Castrillon. Dataflow programming. PACT Tutorial. Oct 2015
DMAs, sema-
phoresPMU
Peripherals
Communicationsupport
HW queues
Network Processor
Packet DMA
MEMsubsystem
NoC
A15L1
A15L1
L2A15L1
A15L1
VLIW DSP
L1,L2
Application
Architecture model
Non-functional specification
Analysis
Synthesis
Code generation
Property models (timing, energy, error, …)
PNargs_ifft_r.ID = 6U;PNargs_ifft_r.PNchannel_freq_coef = filtered_coef_rightPNargs_ifft_r.PNnum_freq_coef = 0U;PNargs_ifft_r.PNchannel_time_coef = sink_rightPNargs_ifft_r.channel = 1;sink_left = IPCllmrf_open(3, 1, 1);sink_right = IPCllmrf_open(7, 1, 1);PNargs_sink.ID = 7U;PNargs_sink.PNchannel_in_left = sink_leftPNargs_sink.PNnum_in_left = 0U;PNargs_sink.PNchannel_in_right = sink_rightPNargs_sink.PNnum_in_right = 0U;taskParams.arg0 = (xdc_UArg)&PNargs_srctaskParams.priority = 1;
ti_sysbios_knl_Task_create((ti_sysbios_knl_Task_FuncPtr&taskParams, &eb);
glob_proc_cnt++;hasProcess = 1;taskParams.arg0 = (xdc_UArg)&PNargs_fft_ltaskParams.priority = 1;
ti_sysbios_knl_Task_create((ti_sysbios_knl_Task_FuncPtrft_Templ, &taskParams, &eb);
glob_proc_cnt++;hasProcess = 1;taskParams.arg0 = (xdc_UArg)&PNargs_ifft_rtaskParams.priority = 1;
ti_sysbios_knl_Task_create((ti_sysbios_knl_Task_FuncPtrfft_Templ, &taskParams, &eb);
glob_proc_cnt++;hasProcess = 1;taskParams.arg0 = (xdc_UArg)&PNargs_sinktaskParams.priority = 1;
cf. Silexica’s tool flows
Example for SDFs
q Compute topology matrix, and solve system of equations
q Solution: repetition vector serve to unroll the graph [1 3 2]
q Perform mapping and scheduling on the resulting directed acyclic graph (DAG)
19 © J. Castrillon. Dataflow programming. PACT Tutorial. Oct 2015
Example for KPNs: Static & dynamic analysis
q Need to understand process interactions
20 © J. Castrillon. Dataflow programming. PACT Tutorial. Oct 2015
f1(...);write(&c2);
read(&c1);f3(...);write(&c3);f4(...);write(&c3);
read(&c2);f5(...);write(&c4);
read(&c0);
f7(...);read(&c4);
read(&c3);f6(...);read(&c3);
Unrolled CFG for
t
R2R3R4
tf1R1R2R3R4
f1 f1f2
f6 f7 f7 f7 f7
f6 f7 f7 f7 f7f5
f4f3f1R1 f1 f1 f2
f4f3
Unrolled CFG for
...
...
...
...
t
R2R3R4 f6 f7 f7 f7 f7
f5f4f3
f1R1 f1 f1 f2
...
t1
t2t3t2 t1
f5 f5
f2(...);write(&c1);
f5 f5
f5 f5 f5
f1(...);write(&c2);
read(&c1);f3(...);write(&c3);f4(...);write(&c3);
read(&c2);f5(...);write(&c4);
read(&c0);
f7(...);read(&c4);
read(&c3);f6(...);read(&c3);
Unrolled CFG for
t
R2R3R4
tf1R1R2R3R4
f1 f1f2
f6 f7 f7 f7 f7
f6 f7 f7 f7 f7f5
f4f3f1R1 f1 f1 f2
f4f3
Unrolled CFG for
...
...
...
...
t
R2R3R4 f6 f7 f7 f7 f7
f5f4f3
f1R1 f1 f1 f2
...
t1
t2t3t2 t1
f5 f5
f2(...);write(&c1);
f5 f5
f5 f5 f5
[Castrillon14]
KPN & DDF: Tracing
q Dynamic analysis based on execution traces [Castrillon10/13, Brunet15, Singh15]
21 © J. Castrillon. Dataflow programming. PACT Tutorial. Oct 2015
DAG representation for further analysis and synthesis
Models from functional specification
q Inspect functional specification of actors/processes (cf. 2nd talk)q Instrumentation, emulation, cost models, …q For timing, energy, errors, …
q Example
22 © J. Castrillon. Dataflow programming. PACT Tutorial. Oct 2015
Profiling: Execution counts, branch stats
[SAMOS14]
Models based on algorithmic descriptions
q When functional specification is not meant for synthesisq Common/required in heterogeneous systems (special components)
q Need to match algorithms to hardware components [Castrillon10b, Castrillon11]
23 © J. Castrillon. Dataflow programming. PACT Tutorial. Oct 2015
Plain code
Plain code
Platform model + characterization of special components
... ... ...Algorithm library
N: Algorithmic actorsF: Existing implementation in target platform
Multiple-applications
q Use traces and mappings to reason about platform sharing
24 © J. Castrillon. Dataflow programming. PACT Tutorial. Oct 2015
Multiple-applications (2)
q Quickly discard “bad” multi-application configurations by observing the platform utilization profiles
25 © J. Castrillon. Dataflow programming. PACT Tutorial. Oct 2015
Outline
q Heterogeneous systems
q Dataflow models
q Analysis and synthesis
q Summary
26 © J. Castrillon. Dataflow programming. PACT Tutorial. Oct 2015
Summary
q Need programming models (and HW/SW stacks) to handle heterogeneityq Even more dramatic in the post-CMOS era
q Dataflow modelsq Natural way to describe some applicationsq Amenable to analysis and synthesis for parallel execution
q Discuss different kinds of models and required analysisq Need models of hardware for synthesis
27 © J. Castrillon. Dataflow programming. PACT Tutorial. Oct 2015
Acknowledgements
q German Cluster of Excellence: Center for Advancing Electronics Dresden (www.cfaed.tu-dresden.de)
q Collaborative research center (CRC): Highly Adaptive Energy-Efficient Computing (HAEC)
q Silexica Software Solutions GmbH
28 © J. Castrillon. Dataflow programming. PACT Tutorial. Oct 2015
CRC 912: Highly Adaptive Energy-‐Efficient Computing
References
[Trommer15] Trommer, et al. "Functionality-Enhanced Logic Gate Design Enabled by Symmetrical Reconfigurable Silicon Nanowire Transistors”, IEEE Trans on in Nanotechnology, pp.689-698, July 2015
[Voigt14] Andreas Voigt, et al. “Towards Computation with Microchemomechanical Systems”, In International Journal of Foundations of Computer Science, World Scientific, volume 25, 2014
[Augonnet10] C. Augonnet, et al. “StarPU: a unified platform for task scheduling on heterogeneous multicore architectures“, Concurrency and Computation: Practice and Experience 23.2 (2011), pp. 187–198.
[Kwok99] Y.-K. Kwok, et al, “Static Scheduling Algorithms for Allocating Directed Task Graphs to Multiprocessors”, ACM Comput. Surv., ACM Press, 1999, 31, 406 - 471
[Castrillon14] J. Castrillon, et al, “Programming Heterogeneous MPSoCs: Tool Flows to Close the Software Productivity Gap”, Springer, 2014
[Castrillon10] J. Castrillon, et al, “Trace-based KPN composability analysis for mapping simultaneous applications to MPSoC platforms”, DATE’10, pp. 753-758, 2010
[Castrillon13] J. Castrillon, et al, “MAPS: Mapping concurrent dataflow applications to heterogeneous MPSoCs,” IEEE Trans on Industrial Informatics, vol. 9, no. 1, pp. 527–545, 2013
[Brunet15] Brunet, Simone C. “Analysis and optimization of dynamic dataflow programs”. Diss. ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE, 2015
[Singh15] Singh, Amit, et al. "Resource and Throughput Aware Execution Trace Analysis for Efficient Run-time Mapping on MPSoCs.”, TCAD 2015
[SAMOS14] J.F. Eusse, et al, "Pre-architectural performance estimation for ASIP design based on abstract processor models," Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIV), 2014 pp.133-140, 2014
[Castrillon10b] J. Castrillon, et al, “Component-based waveform development: The nucleus tool flow for efficient and portable SDR,” Wireless Innovation Conference and Product Exposition (SDR), 2010
[Castrillon11] J. Castrillon, et al, “Component-based waveform development: The nucleus tool flow for efficient and portable software defined radio”, Analog Integrated Circuits and Signal Processing, vol. 69, no. 2–3, pp. 173–190, 2011
29 © J. Castrillon. Dataflow programming. PACT Tutorial. Oct 2015