View
36
Download
0
Category
Preview:
Citation preview
Reactive Runtime Systems for
Heterogeneous Extreme Computations
Thomas Sterling
Chief Scientist, CREST
Professor, School of Informatics and Computing
Indiana University
November 19, 2014
Shifting Paradigms of Computing • Abacus
– Counting tables
• Pascaline
• Difference engine
– Charles Babbage
– Per Georg Scheutz
• Tabulators
– Herman Hollerith
• Analog computer
– Vannevar Bush Machine
• Harvard Architecture
– Howard Aiken
– Konrad Zuse
– Charles Babbage Analytical Engine 2
The von Neumann Age • Foundations:
– Information Theory – Claude Shannon
– Computabilty – Turing/Church
– Cybernetics – Norbert Wiener
– Stored Program Computer Architecture – von Neumann
• The von Neumann Shift: 1945 – 1960 – Vacuum tubes, core memory
– Technology assumptions
• ALUs are most expensive components
• Memory capacity and clock rate are scale drivers – mainframes
• Data movement of secondary importance
• Von Neumann extended: 1960 – 2014 – Semiconductor, Exploitation of parallelism
– Out of order completion
– Vector
– SIMD
– Multiprocessor (MIMD)
• SMP
– Maintain sequential consistency
• MPP/Clusters
– Ensemble computations with message passing
3
Glia-Cell/Vasculature O(1-10x)
Reaction-Diffusion O(100-1,000x)
Molecular Dynamics
O(>1,000,000,000x?)
Plasticity O(1-10x)
Learning & Memory O(10-100x)
Behavior O(100-1000x)
Subcellular
Detail
Capability Computing
Computational Complexity
Memory Requirements
1 MB
10 GB
1 TB
100 TB
100 PB
Cellular Neocortical Column
Cellular Mesocircuit
Cellular Rodent Brain
Cellular Human Brain
1 Gigaflops
1 Teraflops 1 Petaflops 1 Exaflops
Single Cellular Model
EPFL 4-rack BlueGene/L
CADMOS 4-rack BlueGene/P
BBP / CSCS 4-rack BlueGene/Q + BGAS
Planned EU – HBP Exaflop machine
Jülich 28-rack BlueGene/Q machine
The Negative Impact of Global Barriers in Astrophysics Codes
Computational phase
diagram from the MPI based
GADGET code (used for N-
body and SPH simulations)
using 1M particles over four
time steps on 128 procs.
Red indicates computation
Blue indicates waiting for
communication
Clock Rate
10
100
1,000
10,000
1/1/1992
1/1/1996
1/1/2000
1/1/2004
1/1/2008
1/1/2012
Clo
ck R
ate
(M
Hz)
Heavyweight Lightweight Heterogeneous
Courtesy of Peter Kogge, UN D
1.E+01
1.E+02
1.E+03
1.E+04
1.E+05
1.E+06
1.E+07
1/1/1992
1/1/1996
1/1/2000
1/1/2004
1/1/2008
1/1/2012
TC (
Flo
ps/
Cyc
le)
Heavyweight Lightweight Heterogeneous
Total Concurrency
Courtesy of Peter Kogge, UN D
Gain with Respect to Cores per Node and Latency;
8 Tasks per Core,
Overhead of 16 reg-ops, 8% Network Ops
0
1
2
3
4
5
6
7
8
9
1 2 4 8 16 32
Perf
orm
an
ce G
ain
Cores per Node
Performance Gain of Non-Blocking Programs over Blocking Programs with Respect to Core Count (Memory Contention)
and Net. Latency
64
128
256
512
1024
2048
4096
8192
Latency
(Reg-ops)
Gain with Respect to Cores per Node and
Overhead;
Latency of 8192 reg-ops, 64 Tasks per Core
12
48
1632
0
10
20
30
40
50
60
70
Perf
orm
an
ce G
ain
Performance Gain of Non-Blocking Programs over Blocking Programs with Varying Core Counts (Memory Contention)
and Overheads
•Objectives
– High utilization of each core
– Scaling to large number of cores
– Synchronization reducing algorithms
•Methodology
– Dynamic DAG scheduling (QUARK)
– Explicit parallelism
– Implicit communication
– Fine granularity / block data layout
•Arbitrary DAG with dynamic scheduling
Fork-join parallelism
Notice the synchronization
penalty in the presence of
heterogeneity.
The Purpose of a QUARK
Runtime
DAG scheduled
parallelism Courtesy of Jack Dongarra, UTK
• Dharma – Sandia National Laboratories
• Legion – Stanford University
• Charm++ – University of Illinois
• Uintah – University of Utah
• STAPL – Texas A&M University
• HPX – Indiana University
• OCR – Rice University
11
Asynchronous Many-Task Runtime Systems
Semantic Components of ParalleX
Overlapping computational phases for hydrodynamics
MPI HPX
Computational
phases for LULESH
(mini-app for
hydrodynamics
codes).
Red indicates work
White indicates
waiting for
communication
Overdecomposition:
MPI used 64 process
while HPX used 1E3
threads spread across
64 cores.
HPX-5 Development Progress
Zoom-in on best performers
All cases run on 16 cores (1 locality)
Courtesy of M att Anderson, I ndiana University
Wide-word Struct ALU
. . . Thread 0 registers
Thread N-1 registers
…
Scratchpad
Memory Interface Row Buffers
Dataflow
Control
State
Wide
Instruction
Buffer
Parcel
Handler
Thread
Manager
Memory Vault
Fault
Detection
and
Handling
Power
Management
Access
Control
AGAS Translation
Datapaths
with
associated
control
Control-only
interfaces
PRECISE
Decoder
I
N
T
E
R
C
O
N
N
E
C
T
A B
C D
E
F
G H I
J
K L
M N
O
P
Neo-Digital Age
• Goal and Objectives – Devise means of exploiting nano-scale semiconductor technologies
at end of Moore’s Law to leverage fabrication facilities investments
– Create scalable structures and semantics capable of optimal
performance (time to solution) within technology and power
limitations
• Technical Strategy 1. Liberate parallel computer architecture from von Neumann (vN)
archaism for efficiency and scalability; eliminate vN bottleneck
2. Rebalance and integration of functional elements for data
movement, operations, storage, control to minimize time & energy
3. Emphasize tight coupled logic locality for low time & energy
4. Dynamic adaptive localized control to address asynchrony and
expose parallelism with emerging behavior of global computing
5. Innovation in execution model for governing principles at all levels
16
Neo-Digital Age – extending past foundations
• Near nano-scale semiconductor technology
– Flat-lining of Moore’s Law at single-digit nano-meters
– Cost benefits of fab lines for economy of scale through mass-market
• Logic function modules
– Exploit existing and new IP for efficient functional units
– ALUs, latches/registers, nearest-neighbor data paths
• Integrated optics
– Orders of magnitude bandwidth increase
– Inter-socket
– On chip
• Advanced packaging and cooling
– Dramatic improvement opportunities in volumetric utilization
– 3-D integration of combined memory/logic/communication dies
– Return to wafer-scale integration 17
Neo-Digital Age – Principles
• Elimination of vN-based parallel architecture
– Constrained parallel control state
– Processor-centric approach optimizes for ALU utilization with very deep memory
hierarchy; wrong answer
• ALU-pervasive structures
– Merge ALUs with storage and communication structures
– High availability of ALUs rather than high utilization
– Slashes access latencies to save time and energy
• Cells of logic/memory/data-passing for single-cycle actions
– Optimized for space/energy/performance
– Optimized for memory bandwidth utilization
• Emphasis on fine-grain nearest-neighbor data-movement structures
– Direct access to adjacent state storage
– Enables communication through nearest neighbor
• Virtual objects in global name space
– Data, instructions, synchronization
– Intra-medium packet-switched wormhole routing
– Dynamic allocation
18
Prior Art: Non-von Neumann Architectures • Dataflow
– Static (Dennis)
– Dynamic (Arvind)
• Systolic Arrays
– Streaming (HT Kung)
• Neural Networks – connectionist
• Processor in Memory (PIM)
– Bit or word-level logic on-chip at/near sense amps of memory
– SIMD (Iobst) or MIMD (Kogge)
• Cellular Automata (CA)
– Global emergent behavior from local rules and state
– von Neumann (ironically)
• Continuum Computer Architecture
– CA with global naming and active packets (Sterling/Brodowicz) 19
Properties • Local communication
– Systolic array: mostly
– Dataflow: classically not
– Processor in Memory: logic at the sense amps
– Cellular Automata: fully
– Continuum Computer Architecture: fully, with pipelined packet switched
– Neural networks: not
• Event driven for asynchrony management – Systolic array: classically not, iWarp yes
– Dataflow: yes
– PIM: no
– Cellular Automata: can be
– CCA: yes
– Neural networks: classically no, neuromorphic yes
• Merged logic/storage/communication – Systolic array: yes
– Dataflow: classically not, possible
– PIM: yes
– Cellular Automata: yes
– CCA: yes
– Neural networks: yes
20
tag
match
logic
control
ALU
value
network interface
register
file
tag value
tag value
CCA die CCA system
3-D die stack
fonton
TSVs and on-chip
network
optical fiber
bundle
integrated
optical transceiver
Performance
• Maximum ALUs – For high utilization
• Maximum memory bandwidth
• Lots of inter-cell communication bandwidth
• Reduced overhead
• Reduced latency
• Adaptive routing for contention avoidance
• Multi-variate storage beyond the bit
• Multi-variate logic beyond base-2 Boolean
• Dynamic adaptive execution model 23
Energy
• Minimize distance between elements
– Permeate structures with ALUs
– Short latencies between memory & registers
• Multiple clock rates to match timings between logic and
storage and
• Neighbor cell memory/register access
• Eliminate large caches and multi-level caches
• Eliminate speculative execution
24
Questions can be Answered Now
1. Trade-offs of DRAM bits and SRAM bits – Space, Time, Energy
2. Cell rules – derived from parallel execution model
3. Cell granularity – Breakpoint where mitosis is better
4. Design of logic cell (Fonton) – Very simple compared to full conventional processors
5. Reference implementation – Emulation, Simulation, FPGA, ASIC, custom
6. 3-D packaging
7. Software environment
8. Application analysis 25
Conclusions
• New high density functional structures are
required at end of Moore’s Law and are emerging
• Reactive Runtime systems supported by
innovations in hardware architecture mechanisms
will exploit extremes of parallelism at high
efficiency
• Neo-Digital age advances beyond von Neumann
architectures to maximize execution concurrency
and react to uncertainties of asynchrony
26
Recommended