Reactive Runtime Systems for Heterogeneous Extreme...

Reactive Runtime Systems for

Heterogeneous Extreme Computations

Thomas Sterling

Chief Scientist, CREST

Professor, School of Informatics and Computing

Indiana University

November 19, 2014

Shifting Paradigms of Computing • Abacus

– Counting tables

• Pascaline

• Difference engine

– Charles Babbage

– Per Georg Scheutz

• Tabulators

– Herman Hollerith

• Analog computer

– Vannevar Bush Machine

• Harvard Architecture

– Howard Aiken

– Konrad Zuse

– Charles Babbage Analytical Engine 2

The von Neumann Age • Foundations:

– Information Theory – Claude Shannon

– Computabilty – Turing/Church

– Cybernetics – Norbert Wiener

– Stored Program Computer Architecture – von Neumann

• The von Neumann Shift: 1945 – 1960 – Vacuum tubes, core memory

– Technology assumptions

• ALUs are most expensive components

• Memory capacity and clock rate are scale drivers – mainframes

• Data movement of secondary importance

• Von Neumann extended: 1960 – 2014 – Semiconductor, Exploitation of parallelism

– Out of order completion

– Vector

– SIMD

– Multiprocessor (MIMD)

• SMP

– Maintain sequential consistency

• MPP/Clusters

– Ensemble computations with message passing

Glia-Cell/Vasculature O(1-10x)

Reaction-Diffusion O(100-1,000x)

Molecular Dynamics

O(>1,000,000,000x?)

Plasticity O(1-10x)

Learning & Memory O(10-100x)

Behavior O(100-1000x)

Subcellular

Detail

Capability Computing

Computational Complexity

Memory Requirements

100 TB

100 PB

Cellular Neocortical Column

Cellular Mesocircuit

Cellular Rodent Brain

Cellular Human Brain

1 Gigaflops

1 Teraflops 1 Petaflops 1 Exaflops

Single Cellular Model

EPFL 4-rack BlueGene/L

CADMOS 4-rack BlueGene/P

BBP / CSCS 4-rack BlueGene/Q + BGAS

Planned EU – HBP Exaflop machine

Jülich 28-rack BlueGene/Q machine

The Negative Impact of Global Barriers in Astrophysics Codes

Computational phase

diagram from the MPI based

GADGET code (used for N-

body and SPH simulations)

using 1M particles over four

time steps on 128 procs.

Red indicates computation

Blue indicates waiting for

communication

Clock Rate

10,000

1/1/1992

1/1/1996

1/1/2000

1/1/2004

1/1/2008

1/1/2012

Heavyweight Lightweight Heterogeneous

Courtesy of Peter Kogge, UN D

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

1.E+07

1/1/1992

1/1/1996

1/1/2000

1/1/2004

1/1/2008

1/1/2012

Heavyweight Lightweight Heterogeneous

Total Concurrency

Courtesy of Peter Kogge, UN D

Gain with Respect to Cores per Node and Latency;

8 Tasks per Core,

Overhead of 16 reg-ops, 8% Network Ops

1 2 4 8 16 32

Cores per Node

Performance Gain of Non-Blocking Programs over Blocking Programs with Respect to Core Count (Memory Contention)

and Net. Latency

Latency

(Reg-ops)

Gain with Respect to Cores per Node and

Overhead;

Latency of 8192 reg-ops, 64 Tasks per Core

Performance Gain of Non-Blocking Programs over Blocking Programs with Varying Core Counts (Memory Contention)

and Overheads

•Objectives

– High utilization of each core

– Scaling to large number of cores

– Synchronization reducing algorithms

•Methodology

– Dynamic DAG scheduling (QUARK)

– Explicit parallelism

– Implicit communication

– Fine granularity / block data layout

•Arbitrary DAG with dynamic scheduling

Fork-join parallelism

Notice the synchronization

penalty in the presence of

heterogeneity.

The Purpose of a QUARK

Runtime

DAG scheduled

parallelism Courtesy of Jack Dongarra, UTK

• Dharma – Sandia National Laboratories

• Legion – Stanford University

• Charm++ – University of Illinois

• Uintah – University of Utah

• STAPL – Texas A&M University

• HPX – Indiana University

• OCR – Rice University

Asynchronous Many-Task Runtime Systems

Semantic Components of ParalleX

Overlapping computational phases for hydrodynamics

MPI HPX

Computational

phases for LULESH

(mini-app for

hydrodynamics

codes).

Red indicates work

White indicates

waiting for

communication

Overdecomposition:

MPI used 64 process

while HPX used 1E3

threads spread across

64 cores.

HPX-5 Development Progress

Zoom-in on best performers

All cases run on 16 cores (1 locality)

Courtesy of M att Anderson, I ndiana University

Wide-word Struct ALU

. . . Thread 0 registers

Thread N-1 registers

Scratchpad

Memory Interface Row Buffers

Dataflow

Control

Instruction

Buffer

Parcel

Handler

Thread

Manager

Memory Vault

Detection

Handling

Management

Access

Control

AGAS Translation

Datapaths

associated

control

Control-only

interfaces

PRECISE

Decoder

Neo-Digital Age

• Goal and Objectives – Devise means of exploiting nano-scale semiconductor technologies

at end of Moore’s Law to leverage fabrication facilities investments

– Create scalable structures and semantics capable of optimal

performance (time to solution) within technology and power

limitations

• Technical Strategy 1. Liberate parallel computer architecture from von Neumann (vN)

archaism for efficiency and scalability; eliminate vN bottleneck

2. Rebalance and integration of functional elements for data

movement, operations, storage, control to minimize time & energy

3. Emphasize tight coupled logic locality for low time & energy

4. Dynamic adaptive localized control to address asynchrony and

expose parallelism with emerging behavior of global computing

5. Innovation in execution model for governing principles at all levels

Neo-Digital Age – extending past foundations

• Near nano-scale semiconductor technology

– Flat-lining of Moore’s Law at single-digit nano-meters

– Cost benefits of fab lines for economy of scale through mass-market

• Logic function modules

– Exploit existing and new IP for efficient functional units

– ALUs, latches/registers, nearest-neighbor data paths

• Integrated optics

– Orders of magnitude bandwidth increase

– Inter-socket

– On chip

• Advanced packaging and cooling

– Dramatic improvement opportunities in volumetric utilization

– 3-D integration of combined memory/logic/communication dies

– Return to wafer-scale integration 17

Neo-Digital Age – Principles

• Elimination of vN-based parallel architecture

– Constrained parallel control state

– Processor-centric approach optimizes for ALU utilization with very deep memory

hierarchy; wrong answer

• ALU-pervasive structures

– Merge ALUs with storage and communication structures

– High availability of ALUs rather than high utilization

– Slashes access latencies to save time and energy

• Cells of logic/memory/data-passing for single-cycle actions

– Optimized for space/energy/performance

– Optimized for memory bandwidth utilization

• Emphasis on fine-grain nearest-neighbor data-movement structures

– Direct access to adjacent state storage

– Enables communication through nearest neighbor

• Virtual objects in global name space

– Data, instructions, synchronization

– Intra-medium packet-switched wormhole routing

– Dynamic allocation

Prior Art: Non-von Neumann Architectures • Dataflow

– Static (Dennis)

– Dynamic (Arvind)

• Systolic Arrays

– Streaming (HT Kung)

• Neural Networks – connectionist

• Processor in Memory (PIM)

– Bit or word-level logic on-chip at/near sense amps of memory

– SIMD (Iobst) or MIMD (Kogge)

• Cellular Automata (CA)

– Global emergent behavior from local rules and state

– von Neumann (ironically)

• Continuum Computer Architecture

– CA with global naming and active packets (Sterling/Brodowicz) 19

Properties • Local communication

– Systolic array: mostly

– Dataflow: classically not

– Processor in Memory: logic at the sense amps

– Cellular Automata: fully

– Continuum Computer Architecture: fully, with pipelined packet switched

– Neural networks: not

• Event driven for asynchrony management – Systolic array: classically not, iWarp yes

– Dataflow: yes

– PIM: no

– Cellular Automata: can be

– CCA: yes

– Neural networks: classically no, neuromorphic yes

• Merged logic/storage/communication – Systolic array: yes

– Dataflow: classically not, possible

– PIM: yes

– Cellular Automata: yes

– CCA: yes

– Neural networks: yes

control

network interface

register

tag value

CCA die CCA system

3-D die stack

fonton

TSVs and on-chip

network

optical fiber

bundle

integrated

optical transceiver

Performance

• Maximum ALUs – For high utilization

• Maximum memory bandwidth

• Lots of inter-cell communication bandwidth

• Reduced overhead

• Reduced latency

• Adaptive routing for contention avoidance

• Multi-variate storage beyond the bit

• Multi-variate logic beyond base-2 Boolean

• Dynamic adaptive execution model 23

Energy

• Minimize distance between elements

– Permeate structures with ALUs

– Short latencies between memory & registers

• Multiple clock rates to match timings between logic and

storage and

• Neighbor cell memory/register access

• Eliminate large caches and multi-level caches

• Eliminate speculative execution

Questions can be Answered Now

1. Trade-offs of DRAM bits and SRAM bits – Space, Time, Energy

2. Cell rules – derived from parallel execution model

3. Cell granularity – Breakpoint where mitosis is better

4. Design of logic cell (Fonton) – Very simple compared to full conventional processors

5. Reference implementation – Emulation, Simulation, FPGA, ASIC, custom

6. 3-D packaging

7. Software environment

8. Application analysis 25

Conclusions

• New high density functional structures are

required at end of Moore’s Law and are emerging

• Reactive Runtime systems supported by

innovations in hardware architecture mechanisms

will exploit extremes of parallelism at high

efficiency

• Neo-Digital age advances beyond von Neumann

architectures to maximize execution concurrency

and react to uncertainties of asynchrony

Reactive Runtime Systems for Heterogeneous Extreme...

Documents

Runtime Enforcement of Reactive Systems using Synchronous

LEGION: PROGRAMMING DISTRIBUTED ...Abstract This thesis covers the design and implementation of Legion, a new programming model and runtime system for targeting distributed heterogeneous

Faithful Performance Prediction of a Dynamic Task-Based ... · Faithful Performance Prediction of a Dynamic Task-Based Runtime System for Heterogeneous Multi-Core Architectures Luka

HETEROGENEOUS FENTON PROCESS USING THE ...Heterogeneous Fenton Process Using the Mineral Hematite for the Discolouration of a Reactive Dye Solution 607 Brazilian Journal of Chemical

Cloud Foundry Diego: The New Cloud Runtime Foundry Diego... · Cloud Foundry Diego: The New Cloud Runtime Heterogeneous Container Scheduling, Docker & More . How many people here

Radiation Modeling Using the Uintah Heterogeneous CPU/GPU Runtime System

Popcorn Linux: A Compiler and Runtime for State Transformation Between Heterogeneous-ISA

Predictive Runtime Code Scheduling for Heterogeneous Architectures 1

SALSA Lite: A Hash-Based Actor Runtime for E cient Local ...wcl.cs.rpi.edu/papers/2014_cob.pdf · Erlang’s scheduler accomplishes fairness by. Fig.1. Actors are reactive entities

Runtime Architectural Modeling for Future Internet ...sisinflab.poliba.it/publications/2016/MCVGS16/main_springer.pdf · of services may really bene t from all such heterogeneous

Adaptive Runtime Resource Management of Heterogeneous Resourcescs.ulb.ac.be/public/_media/teaching/infoh508/planning... · 2010-11-26 · Adaptive Runtime Resource Management of Heterogeneous

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu

Reactive multi-context systems: Heterogeneous reasoning in ...jleite/papers/aij18.pdf · triplestores, to the more expressive ones, such as ontology languages (e.g., description logics),

Red Hat OpenShift Application Runtimes 1 Eclipse Vert.x ... · PDF fileCHAPTER 1. RUNTIME DETAILS Eclipse Vert.x is a toolkit for writing reactive, non-blocking, asynchronous applications

An exponential integrator for advection-dominated reactive transport in heterogeneous ...gabriel/Research/TLG.pdf · 2010-02-08 · An exponential integrator for advection-dominated

The Uintah Framework: A Unified Heterogeneous Task ...ahumphre/pubs/HPCAST-19-SCI-Uintah.pdf · The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System Alan

SEVENTH FRAMEWORK PROGRAMME Research ...D12.1 Heterogeneous and Auto-tuned Runtime System PRACE-2IP - RI-283493 22.02.2013 i Project and Deliverable Information Sheet PRACE Project

Feasibility Evaluation of Quinary Heterogeneous Reactive ...koasas.kaist.ac.kr/bitstream/10203/189827/1/ie5005477.pdf · Feasibility Evaluation of Quinary Heterogeneous Reactive Extractive

09-05 Teng MediaTek Keynote- Technologies for Automated ......Heterogeneous Runtime App Libs obj. detection/ face detection/ face recognition/… Tool Kits CPU GPU APU(AI Processing

ReCache: Reactive Caching for Fast Analytics over ...Tahir Azim, Manos Karpathiotakis and Anastasia Ailamaki. ReCache: Re-active Caching for Fast Analytics over Heterogeneous Data