163
©Jesper Larsson Träff WS19 1 High Performance Computing Introduction, overview Jesper Larsson Träff traff@par. … Parallel Computing, 184-5 Favoritenstrasse 16, 3. Stock Sprechstunde: By email-appointment

High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

1

High Performance ComputingIntroduction, overview

High Performance ComputingIntroduction, overview

Jesper Larsson Träfftraff@par. …

Parallel Computing, 184-5Favoritenstrasse 16, 3. Stock

Sprechstunde: By email-appointment

Page 2: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

2

High Performance Computing: A (biased) overview

Concerns: Either

1. Achieving highest possible performance as needed by some application(s)

2. Getting highest possible performance out of given (highly parallel) system

• Ad 1: Anything goes, including designing and building new systems, raw (application) performance matters

• Ad 2: Understanding and exploiting details at all levels of given system

Page 3: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

3

• Understanding modern processors: Processor architecture, memory system, single-core performance, multi-core parallelism

• Understanding parallel computers: Communication networks

• Programming parallel systems efficiently and effectively: Algorithms, interfaces, tools, tricks

All issues at all levels are relevant

…but not always to the same extent and at the same time

Ad. 2

Page 4: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

4

Typical “Scientific Computing” applications

• Climate (simulations: coupled models, multi-scale, multi-physics)

• Earth Science• Long-term weather forecast

• Nuclear physics• Computational chemistry• Computational astronomy• Computational fluid dynamics

• Protein folding, Molecular Dynamics (MD)

• Cryptography (code-breaking)• Weapons (design, nuclear stock pile), defense (“National

Security”), spying (NSA), …

Qualified estimates say these problems require TeraFLOPS, PetaFLOPS, ExaFLOPS, …

Page 5: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

5

Other, newer “High-Performance Computing” applications

Data analytics (Google, Amazon, FB, …), “big data”

Irregular data (graphs), irregular access patterns (graph algorithms)

Application have different characteristics (operations, loops, tasks, access patterns, locality) and requirements (computation, memory, communication).Different HPC architecture trade-offs for different applications

Page 6: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

6

Ad. 1: Special purpose HPC systems for Molecular Dynamics

Special purpose computers have a history in HPC

“Colossus” replica, Tony Sale 2006

N-body computations of forces between molecules to determine movements: Special type of computation with specialized algorithms that could potentially be executed orders of magnitude more efficiently on special-purpose hardware

Page 7: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

7

MDGRAPE-3: PetaFLOPS performance in 2006, more than 3 times faster than BlueGene/L (Top500 #1 at that time)

MDGRAPE-4: Last in the series of a Japanese project of MD supercomputers (RIKEN)

Page 8: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

8

MDGRAPE-4: Last in the series of a Japanese project of MD supercomputers (RIKEN)

Ohmura I, Morimoto G, Ohno Y, Hasegawa A, Taiji M. MDGRAPE-4: A special-purpose computer system for molecular dynamics simulations. Phil. Trans. R. Soc. A 372: 20130387, 2014. http://dx.doi.org/10.1098/rsta.2013.0387

Page 9: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

9

Anton (van Leeuwenhoek): Another special purpose MD system

512-node (8x8x8 torus) Anton machine

D. E. Shaw Research (DESRES)

Special purpose Anton chip (ASIC)

Page 10: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

10

From “Encyclopedia on Parallel Computing”, Springer 2011:

“Prior to Anton’s completion, few reported all-atom protein simulations had reached 2μs, the longest being a 10-μs simulation that took over 3 months on the NCSA Abe supercomputer […]. On June 1, 2009, Anton completed the first millisecond-long simulation – more than 100 times longer than any reported previously.”

Page 11: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

11

Recent Anton 2 installation:

Pittsburg Supercomputing Center (PSC), see• https://www.psc.edu/resources/computing/anton• https://www.psc.edu/news-publications/2181-anton-2-will-

increase-speed-size-of-molecular-simulations

Page 12: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

12

Brian Towles, J. P. Grossman, Brian Greskamp, David E. Shaw: Unifying on-chip and inter-node switching within the Anton 2 network. ISCA 2014: 1-12

David E. Shaw, Martin M. Deneroff, Ron O. Dror, Jeffrey Kuskin, Richard H. Larson, John K. Salmon, Cliff Young, Brannon Batson, Kevin J. Bowers, Jack C. Chao, Michael P. Eastwood, Joseph Gagliardo, J. P. Grossman, Richard C. Ho, Doug Ierardi, IstvánKolossváry, John L. Klepeis, Timothy Layman, Christine McLeavey, Mark A. Moraes, Rolf Mueller, Edward C. Priest, Yibing Shan, Jochen Spengler, Michael Theobald, Brian Towles, Stanley C. Wang: Anton, a special-purpose machine for molecular dynamics simulation. Commun. ACM 51(7): 91-97 (2008)

Ron O. Dror, Cliff Young, David E. Shaw: Anton, A Special-Purpose Molecular Simulation Machine. Encyclopedia of Parallel Computing 2011: 60-71

Page 13: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

13

Ad 1.: Special purpose to general purpose

Special purpose sometimes have wider applicability

Special purpose advantages:• Higher performance (FLOPS) for special types of

computations/applications• More efficient (energy, number of transistors, …)

• Graphics processing processors (GPU) for general purpose computing (GPGPU)

• Field Programmable Gate Arrays (FPGA)

HPC systems: Special purpose processors as accelerators

Page 14: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

14

General purpose MD packages

• GROMACS, www.gromacs.org• NAMD, www.ks.uiuc.edu/Research/namd/

Page 15: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

15

• Dense and sparse matrices, linear equations• PDE (“Partial Differential Equations”, multi-grid methods)• N-body problems (MD again)• …

• Many (parallel) support libraries: • BLAS -> LAPACK -> ScaLAPACK• Intel’s MKL (Math Kernel Library)

• MAGMA/PLASMA

• FLAME/Elemental/PLAPACK [R. van de Geijn]

Other typical components in scientific computing applications

• PETSc (“Portable Extensible Toolkit for Scientific computation”)

M. Snir: “A Note on N-Body Computations with Cutoffs”. Theory Comp. Syst. 37(2): 295-318,2004

Page 16: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

16

Ad. 2: Template High-Performance Computing architecture

Georg Hager, Gerhard Wellein: Introduction to High Performance Computing for Scientists and Engineers. Chapman and Hall / CRC computational science series, CRC Press 2011, ISBN 978-1-439-81192-4, pp. I-XXV, 1-330

• Typical elements of modern, parallel (High-Performance Computing) architectures: “A qualitative approach”

• Balance: Which architecture for which applications?

• Levels of parallelism

• Parallelism in programming model/interface

Page 17: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

17

L1

Lk

Main memory

Communication network

L1 L1

Lk

L1

SIMD

Acc

L1

Lk

Main memory

L1 L1

Lk

L1

SIMD

Acc

• Hierarchical designs: core, processor, node, rack, island, …• Orthogonal capabilities: Accelerators, vectors • Different types parallelism at all levels

NIC NIC

Page 18: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

18

L1

Lk

Main memory

Communication network

L1 L1

Lk

L1

SIMD

Acc

L1

Lk

Main memory

L1 L1

Lk

L1

SIMD

Acc

• Total number of cores (what counts as a core?)• Size of memories• Properties of communication network

NIC NIC

Page 19: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

19

Main memory

Lk

L1

SIMD

Acc

Memory hierarchy

• Compute performance: How many instructions can each core perform per clock cycle (superscalar≥1)

• Special instructions: Vector, SIMD• Accelerator (if integrated in core)

Parallelism in core:• Implicit, hidden (ILP)• Explicit SIMD• Explicit accelerator (GPU)How expressed, exploited?

Page 20: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

20

Compute performance measured in FLOPS: Floating Point Operations per Second

Floating Point: In HPC almost always 64-bit IEEE Floating Point number (32 bits too little for many scientific applications)

FLOPS

M(ega)FLOPS 106

G(iga)FLOPS 109

T(era)FLOPS 1012

P(eta)FLOPS 1015

E(xa)FLOPS 1018

Z(etta)FLOPS 1021

Y(otta)FLOPS 1024

System peak Floating Point Performance (Rpeak)

Definition (HW peak performance):Rpeak ≈ClockFrequency x #FLOP/Cycle x#CPU’s x #Cores/CPU

Optimistic, best case upper bound

Page 21: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

21

Main memory

Lk

L1

SIMD

• Compute performance: How many instructions can core perform per clock cycle (superscalar≥1)

• Special instructions: Vector, SIMD (v≥1 operations per cycle)

Vector processor:Performance from wide SIMD unit

High performance for applications with large vectors

Memory hierarchySuperscalar: Multiple pipelines (integer, logical, FP add, FP mul, …

Requires right mix of instructions

Page 22: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

22

Parallelism through• Pipelining: Also complex

instructions can be delivered once per cycle. Problem: dependencies, branches

• Multiple pipelines: Several different, independent instructions can be executed concurrently

Superscalar: Multiple pipelines (integer, logical, FP add, FP mul, …

Page 23: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

23

Main memory

Lk

L1

SIMD

Acc

• Compute performance: How many instructions can core perform per clock cycle (superscalar≥1)

• Special instructions: Vector, SIMD• Accelerator: In core or external (e.g., GPU)

Heavily accelerated system, one or more accelerators

How tightly integrated with memory system/core?

High performance for applications that fit with accelerator model Acc memory

Memory hierarchy

Page 24: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

24

Main memory

Lk

L1

SIMD

Acc

• Memory hierarchy: Latency (number of cycles to access first Byte), Bandwidth (Bytes/second)

• Balance between compute performance and memory bandwidth

• Memory access times not uniform (NUMA)

Memory hierarchy

Page 25: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

25

Definition (HW Peak Performance):

Rpeak ≈ ClockFrequency x #FLOP/Cycle x #CPU’s x #Cores/CPU

Definition:The hardware efficiency is the ratio Rmax/Rpeak, with Rmax the measured (sustained) application performance, Rpeak the nominal HW peak performance

Measured application performance (sustained performance): How many FLOPS does application achieve on system?

Note: This efficiency measure is totally different from the algorithmic efficiency E = SU/p

What if efficiency « 1?

Page 26: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

26

Main memory

Lk

L1

SIMD

Acc

Application is:• Compute bound, if number of FLOPS per byte read/written

larger than memory bandwidth• Memory bound, if number of FLOPS per byte read/written

smaller than memory bandwidth

Memory hierarchy

Page 27: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

27

Given application (kernel) A:

Arithmetical (Operational) intensity OI:Count (average) number of (Floating Point) OPerations per Byte read/written

Required BW, RB: Performance in (FL)OPS divided by OI

Memory bound: RB > MBCompute bound: RB < MB

Property of application

a = x*x+2*x*x*x+3*x*x*x*x+4*x*x*x*x*x;

Performance and memory bandwidth (MB) properties of processor and memory system

Example: RB on 2GHz, not superscalar processor, 64-bit Float

OI = 16/(2*8) = 1 FLOP/Byte, RB = 2GByte/s Can memory system deliver?

More in Roofline lecture

Page 28: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

28

L1

Lk

Main memory

L1 L1

Lk

L1

SIMD

Acc

Memory hierarchy

Multi-core CPU

• Cache hierarchy: 2, 3, 4, … levels: How to exploit efficiently (capacity, associativity, …)?

• Caches shared at certain levels (different in different processors, e.g., AMD, Intel, …)

• Caches coherent?• Memory typically (very) NUMA

Cache management most often transparent (done by CPU); can have hugeperformance impact.

Applications do not benefit equally well from cache system

Shared memory parallelism (OpenMP, threads, MPI, …)

Page 29: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

29

L1

Lk

Main memory

Communication network

L1 L1

Lk

L1

SIMD

Acc

NIC

Properties of communication network:• Latency (time to initiate communication, first Byte),

Bandwidth (Bytes/second) or time per unit• Contention?

• How is communication network integrated with memory and processor?

• What can communication coprocessor (NIC) do?

• Possible to overlap communication and computation?

Page 30: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

30

L1

Lk

Main memory

Communication network

L1 L1

Lk

L1

SIMD

Acc

NIC

Application is:• Communication bound: Number of FLOPS per byte (OI)

smaller than communication bandwidth

Large number of cores with large compute performance (accelerator) share network bandwidth

Network parallelism:• Explicit (MPI-like),

implicit?• Between cores,

between nodes?

Page 31: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

31

Samuel Williams, Andrew Waterman, David A. Patterson: Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52(4): 65-76 (2009)

Nicolas Denoyelle, Brice Goglin, Aleksandar Ilic, Emmanuel Jeannot, Leonel Sousa: Modeling Non-Uniform Memory Access on Large Compute Nodes with the Cache-Aware Roofline Model. IEEE Trans. Parallel Distrib. Syst. 30(6): 1374-1389 (2019)Aleksandar Ilic, Frederico Pratas, Leonel Sousa:Beyond the Roofline: Cache-Aware Power and Energy-Efficiency Modeling for Multi-Cores. IEEE Trans. Computers 66(1): 52-58 (2017)

David Cardwell, Fengguang Song: An Extended Roofline Model with Communication-Awareness for Distributed-Memory HPC Systems. HPC Asia 2019: 26-35

Roofline models (more in Roofline lecture):

Page 32: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

32

Some past and present HPC architectures

Looking at Top500 list: www.top500.org

Ranks supercomputer performance by LINPACK benchmark (HPL), updated twice yearly (June, ISC Germany; November ACM/IEEE Supercomputing)

Page 33: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

33

Serious background of Top500:Benchmarking to evaluate (super)computer performance

In HPC: Often based on one single benchmarkHigh Performance LINPACK (HPL) solves a system of linear equations under specified constraints (minimum number of operations), see www.top500.org

HPL performs well (high computational efficiency, high AI) on many architectures; allows a wide range of optimizations

HPL is less demanding on communication performance: Compute bound, O(n) FLOPs per Byte (OI)

HPL does not give a balanced view of “overall” system capabilities (communication)

HPL is politically important… (much money lost because of HPL…)

Page 34: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

34

LINPACK performance as reported in Top500

• Rmax: FLOPS measured by solving large LINPACK instance• Nmax: Problem size for reaching Rmax• N/2: Problem size for reaching Rmax/2• Rpeak: System Peak Performance as computed by owner

Number of double precision floating point operations needed for solving the linear system must be (at least) 2/3 n3 + O(n2)

Excludes• Strassen and other asymptotically fast matrix-matrix

multiplication methods• Algorithms that compute with less than 64-bit precision

Page 35: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

35

June 2019

#500 system

#1 system

What are the systems at the jumps?

All systems

Page 36: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

36

June 2019: Rank #1

System CoresRmax(TFLOPS)

Rpeak(TFLOPS)

Power (kW)

Summit: IBM Power System AC922, IBM POWER9 22C 3.07GHz, NVIDIA Volta GV100, Dual-rail MellanoxEDR Infiniband , IBM DOE/SC/Oak Ridge National LaboratoryUnited States

2,414,592 148,600.0 200,794.9 10,096

Page 37: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

37

System CoresRmax(TFLOPS)

Rpeak(TFLOPS)

Power (kW)

Sierra: IBM Power System S922LC, IBM POWER9 22C 3.1GHz, NVIDIA Volta GV100, Dual-rail MellanoxEDR Infiniband , IBM / NVIDIA / MellanoxDOE/NNSA/LLNLUnited States

1,572,480 94,640.0 125,712.0 7,438

June 2019: Rank #2

Page 38: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

38

November 2017: Rank #1

System CoresRmax(TFLOPS)

Rpeak(TFLOPS)

Power (kW)

SunwayTaihuLight:Sunway MPP, Sunway SW26010 260C 1.45GHz, Sunway , NRCPC National Supercomputing Center in WuxiChina

10,649,600 93,014.6 125,435.9 15,371

Page 39: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

39

HPL is politically important… (much money lost because of HPL…)

HPL is used to make projections on supercomputing performance trends (as Moore’s “Law”)

HPL is a co-driver for supercomputing “performance” development:It is hard (for a compute center, for a politician, …) to defend building a system that will not rank highly on Top500

Strong (political) drive towards Exascale:

PetaFLOPS was achieved in 2008, ExaFLOPS expected ca. 2018-2020, by simple extrapolation from Top500

Page 40: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

40

November 2016

According to projection, 2018/19 ExaFlop prediction will not hold

Why not? Any specific obstacles to ExaScaleperformance?

Page 41: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

41

November 2017

According to projection, 2018/19 ExaFlop prediction will not hold

Page 42: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

42

June 2019

According to projection, 2018/19 ExaFlop prediction will not hold

Page 43: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

43

HPCC: www.hpcchallenge.org: Benchmark suite (DGEMM, STREAM, PTRANS, Random Access, FFT, B_Eff)HPCG: http://hpcg-benchmark.orgHPGMG: https://crd.lbl.gov/departments/computer-science/PAR/research/hpgmg

Graph500 (Graph search, BFS): www.graph500.orgGreen500 (Energy consumption/efficiency): www.green500.org

Other HPC systems benchmarks

Intended to complement HPL or to highlight other aspects

STREAM: www.cs.virginia.edu/stream: Memory performance

Page 44: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

44

NAS Parallel Benchmarks (NPB): https://www.nas.nasa.gov/publications/npb.html: Benchmark suite of small kernels

• IS: Integer sort• EP: Embarassingly parallel• CG: Conjugate Gradient• MG: Multigrid• FT: Discrete 3D Fast Fourier Transform• BT: Block tridiagonal solver• SP: Scalar Pentadioganal solver• LU: Lower-Upper factorization Gauss-Seidel solver

See later lecture

Often used in research papers. What is evaluated, under which conditions, and compared to what? Understand the benchmarks

Page 45: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

45

Mini Application suite (https://mantevo.org):

• MiniAMR: Adaptive Mesh Refinement• MiniFE: Finite Elements• MiniGhost: 3D halo exchange (ghost cells) for finite

differencing• MiniMD: Molecular Dynamics• CloverLeaf: compressible Euler equations • TeaLeaf: Linear heat conduction equation

Page 46: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

46

• Very early days: Single-processor supercomputers (vector)• After ‘94, all supercomputers are parallel computers

• Earlier days: Custom-made, unique – highest performance processor + highest performance network

• Later days, now: Custom convergence, weaker standard processors, but more of them, weaker networks (InfiniBand, Tori, …)

• Recent years: Accelerators (again): GPUs, FPGA, MIC, …

Using top500: Broad trends in HPC systems architecture

Much interesting computer history in top500 list; but also much is lost, and many details are not there. See what you can find

Page 47: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

47

Example: the Earth Simulator 2002-2004 (#1)

Page 48: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

48

System Vendor Cores Rmax(GFLOPS)

Rpeak(GFLOPS)

Power(KW)

Earth-Simulator

NEC 5120 35860.00 40960.00 3200.00

June 2002, Earth Simulator

• Rmax: Performance achieved on HPL• Rpeak: “Theoretical Peak Performance”, best case, all

processors fully busy

Power: Processors only (cooling, storage)?

Page 49: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

49

Power supply

• ~40TFLOPS

• 5120 vector processors• 8 (NEC SX6) processors per node• 640 nodes, 640x640 full crossbar interconnect

BUT: Energy expensive

Earth Simulator 2 (2009) onlyvector system on Top500

• ~15MW

Page 50: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

50

Vector processor operates on long vectors, not only scalars

Peak performance: 8GFlops (all vector pipes active)

256 element (double/long) vectors

Vector architecture pioneered by Cray (Cray-1 1976, late 60ties, early 70ties). Other vendors: Convex, Fujitsu, NEC, …

Page 51: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

51

Main memory

SIMD

• One instruction• Several, deep pipelines can be kept busy by long vector

registers, no branches, no pipeline stalls• Sufficient memory bandwidth to prefetch next register

during vector instruction execution must be available

Vector registers

1

Page 52: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

52

Main memory

SIMD

• One instruction• Several, deep pipelines can be kept busy by long vector

registers, no branches, no pipeline stalls• Sufficient memory bandwidth to prefetch next register

during vector instruction execution must be available

Vector registers

2

Page 53: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

53

Main memory

SIMD

• One instruction• Several, deep pipelines can be kept busy by long vector

registers, no branches, no pipeline stalls• Sufficient memory bandwidth to prefetch next register

during vector instruction execution must be available • Can sustain several operations per clock over a long interval

Vector registersSIMD

k

Banked memory for high vector bandwidth

Page 54: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

54

Main memory

SIMD

• One instruction• Several, deep pipelines can be kept busy by long vector

registers, no branches, no pipeline stalls• Sufficient memory bandwidth to prefetch next register

during vector instruction execution must be available • Can sustain several operations per clock over a long interval

Vector registersSIMD

HPC: Pipelines for different types of (mostly floating point) operations found in applications (add, mul, divide, √, …; additional special hardware)

Large vector register bank, different types (index, mask)

Banked memory for high vector bandwidth

Page 55: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

55

Prototypical SIMD/data parallel architecture

One (vector) instruction operates on multiple data (long vectors)

G. Blelloch: Vector Models for Data Parallel Computing”, MIT Press, 1990

Page 56: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

56

int a[], b[n], c[n];

double x[n], y[n], z[n];

double xx[n], yy[n], zz[n];

for (i=0; i<n; i++) {

a[i] = b[i]+c[i];

x[i] = y[i]+z[i];

xx[i] = (yy[i]*zz[i])/xx[i];

}

for (i=0; i<n; i+=v) {

vadd(a+i,b+i,c+i);

vdadd(x+i,y+i,z+i);

vdmul(t,yy+i,zz+i);

vddiv(xx+i,t,xx+i);

}

Simple “data parallel (SIMD) loop”, n independent (floating point) operations translated into n/v vector operations

Translates to sth. like

Can keep both integer and floating point pipes busy

n>>v: iteration i can prefetch vector for iteration i+v

Page 57: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

58

High memory bandwidth by organizing memory into banks (NEC SX-6: 2K banks)

Element i, i+1, i+2, … in different banks, element i and i+2K in same bank: bank conflict, expensive because of serialization

32 Memory units, 64 banks each

Special communication processor (RCU) directly connected to memory system

Page 58: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

59

Vectorizable loop structures

for (i=0; i<n; i++) {

a[i] = b[i]+c[i];

}

for (i=0; i<n; i++) {

a[i] = a[i]+b[i]*c[i];

}

DAXPY, fused multiply add (FMA)

Simple loop, integer (long) and floating point operations

Typically pipelines for • floating point add, multiply, divide; • some integer operations; • daxpy; square root; …

Page 59: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

60

Vectorizable loop structures

for (i=0; i<n; i++) {

if (cond[i]) a[i] = b[i]+c[i];

}

Conditional execution handled by masking

for (i=0; i<n; i++) {

R[i] = b[i]+c[i];

MASK[i] = cond[i];

if (MASK[i]) a[i] = R[i];

}

Roughly translates to:

MASK special register for conditional store, R temporary register

Page 60: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

61

Vectorizable loop structures

#pragma vdir vector,nodep

for (i=0; i<n; i++) {

a[ixa[i]] = b[ixb[i]]+c[ixc[i]];

}

Gather/Scatter operations.Compiler may need help

Can cause bank conflicts, depending on index vector

Memory bandwidth dependent on access pattern

Page 61: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

62

Vectorizable loop structures

#pragma vdir vector

for (i=1; i<n; i++) {

a[i] = a[i-1]+a[i];

}

min = a[0];

#pragma vdir vector

for (i=0; i<n; i++) {

if (a[i]<min) min = a[i];

}

Prefix-sums

Min/max operations

With special hardware support

Page 62: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

63

#pragma vdir vector,nodep

for (i=0; i<n; i++) {

a[s*i] = b[s*i]+c[s*i];

}

Strided access

Can cause bank conflicts (some strides always bad)

Vectorizable loop structures

Large-vector processors currently out of fashion in HPC, almost non-existent

NEC SX-8 (2005), NEC SX-9 (2008), NEC SX-ACE (2013)

2009-2013: No NEC vector processors (market lost?)

Page 63: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

64

NEC SX-Aurora TSUBASA: Vector Engine (ca. 2017)

• 8-core vector processor

• 1.2 TBytes/Second memory bandwidth Rpeak: 2.45TFLOPS

Page 64: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

65

Many scientific applications fit well with vector model. Irregular, non-numerical applications often not

Mature compiler technology for vectorization and optimization(loop splitting, loop fusion…). Aim: Keep vector pipes busy

Allen, Kennedy: “Optimizing Compilers for Modern Architectures”, MKP 2002

Scalar (non-vectorizable) code carried out by standard, scalar processor; amount limits performance (Amdahl’s Law)

Vector programming model: Loops, sequential control flow, compiler handles parallelism (implicit) by vectorizing loops (some help from programmer)

Page 65: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

66

Small scale vectorization: Standard processors

• MMX, SSE, AVX, AVX2,… (128 bit vectors, 256 bit vectors)

• Intel MIC/Xeon Phi: 512 bit vectors, new, special vector instructions (2013: Compiler support not yet mature; 2016: Much better), AVX-512 (2018: Xeon Phi defunct!)

High performance on standard processors:• Exploit vectorization potential• Check whether loops where indeed vectorized (gcc –ftree-

vectorizer-verbose=n …, in combination with architecture specific optimizations)

• Intrinsics

2, 4, 8 Floating Point operations simultaneously by one vector instruction (no integers?)

Page 66: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

67

Support for vectorization in OpenMP 3.0

#pragma omp simd [clauses…]

for (i=0; i<n; i++) {

a[i] = b[i]+c[i];

}

Clauses: reduction (for sums), collapse (for nested loops)

Page 67: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

68

Explicit parallelism

• 8-way SMP (8 vector processor per shared-memory node)• Not cache-coherent• Nodes connected by full crossbar

2-level explicit parallelism:

• Intra-node with shared-memory communication

• Inter-node with communication over crossbar

Page 68: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

69

Coherence

Memory system is coherent, if any update (write) to memory by any processor will eventually become visible to any other processor

L1 x

Lk

Main memory

L1 L1

Lk

L1 x

Cache coherence: Any update to a value in cache of some processor will eventuallybecome visible to any other processor (regardless of whether in cache of other processor)

Maintaining cache coherence (across sockets/large multi-cores) can be expensive!

Page 69: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

70

Memory behavior, memory model

• Access (read, write) to different locations may take different time (NUMA: memory network, placement of memory controllers, caches, write buffers)

• In which order will updates to different locations by some processor become visible to other processors?

• Memory model specifies: Which accesses can overtake which other accesses

Sequential consistency: Accesses take effect in program order

Most modern processors are not sequentially consistent

Page 70: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

71

No cache-coherence: Earth Simulator/NEC SX

• Scalar unit of vector processor has cache• Caches of different processors not coherent• Vector units read/write directly to memory, no vector caches• Write-through cache

Different design choice:Cray X1 (vector computer early 2000) had a different, cache-coherent design

• Nodes must coordinate and synchronize• Parallel programming model (OpenMP, MPI) helps

D. Abts, S. Scott, D. J. Lilja: “So Many States, So Little Time: Verifying Memory Coherence in the Cray X1”, IPDPS 2003: 11

Page 71: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

72

Example: MPI and cache non-coherence

i j

MPI_Recv(&y,…,comm,&status);

MPI_Send(&x,…,comm);

x: Mem of rank i y: Mem of rank j

y: Cache of j

Coherency/consistency needed after MPI_Recv: rank j must invalidate cache(lines) at the point where MPI requires coherence (at MPI_Recv)

Incoherent state

Processes i and j on same node

Vectorized memcpy

write

Page 72: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

73

Example: MPI and cache non-coherence

i j

MPI_Recv(&y,…,comm,&status);

MPI_Send(&x,…,comm);

x: Mem of rank i y: Mem of rank j

y: Cache of j

Coherency/consistency needed after MPI_Recv:clear_cache instruction invalidates all cache lines

Incoherent state

Expensive: 1) clear_cache itself; 2) all cached values lost!

Further complication with MPI: structured data/data types; address &y alone do not tell where the data are

Vectorized memcpy

write

Page 73: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

74

Example: OpenMP and cache non-coherence

#pragma omp parallel for

for (i=0; i<n; i++) {

x[i] = f(y[i]);

}

Sequential region: All x[i]’s visible to all threads

OpenMP: All regions (parallel, critical, …) require memory in a consistent state (caches coherent); implicit flush/fence constructs to force visibility (in OpenMP construct)

Lesson: Higher-level programming models can help to alleviate need for low-level, fine-grained cache coherency.

Page 74: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

75

Cache coherence debate

• Cache: Beneficial for applications with spatial and/or temporal locality (not all applications have this: Graph algorithms)

• Caches a major factor in single-processor performance increase (since sometime in the 80ties)

Many new challenges for caches in parallel processors:• Coherency• Scalability• Resource consumption (logic=transistors=chip area; energy)• …

Milo M. K. Martin, Mark D. Hill, Daniel J. Sorin: Why on-chip cache coherence is here to stay. Commun. ACM 55(7): 78-89 (2012)

Too expensive?

Page 75: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

76

MPI and OpenMP

Still most widely used programming interfaces/models for parallel HPC (there are contenders)

MPI: Message-Passing Interface, see www.mpi-forum.org

• MPI processes (ranks) communicate explicitly: point-to-point-communication, one-sided communication, collective communication, parallel I/O

• Subgrouping and encapsulation (communicators)• Much support functionality

OpenMP: shared-memory interface (C/Fortran pragma-extension), data (loops) and task parallel support, see www.openmp.org

Page 76: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

77

Partitioned Global Address Space (PGAS) alternative to MPI

Addressing mechanism for part of the processor-local address space can be shared between processes; referencing non-local parts of partitioned space leads to implicit communication

Language or library supported:Some data structures (typically arrays) can be declared as shared (partitioned) across (all) threads

Note:PGAS not same as Distributed Shared Memory (DSM). PGAS explicitly controls which data structures (arrays) are partitioned, and how they are partitioned

Page 77: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

78

Global array(s):

Thread k owns

a:

Each block of global array in local memory of some process/thread

Simple, block cyclic distribution of array a

PGAS:Data structures (simple arrays) partitioned (shared) over the memory of p threads

Page 78: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

79

Global array(s):

Thread k owns

b = a[i];

a[j] = b;

Thread k:

PGAS Memory model:Defines when update becomes visible to other threads

entails communication if index i or index j is not owned by thread k

a:

Each block of global array in local memory of some process/thread

Page 79: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

80

Global array(s):

a[i] = b[j];

Thread k:

even if neither a[i] nor b[j] owned by k

Thread k owns

PGAS Memory model:Defines when update becomes visible to other threads

a:

Page 80: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

81

Global array(s):

forall(i=0; i<n; i+) {

a[i] = f(x[i]);

}

Owner computes rule:Thread k performs updates only on the elements(indices) owned by/local to k

partitioned (shared) over the memory of p threads

Thread k owns

a:

Page 81: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

82

Typical PGAS features:

Even more extreme:SIMD array languages, array operations parallelized by library and runtime

Often less support for library building (process subgoups) than MPI

• Array assignments/operations translated into communication when necessary based on ownership

• Mostly simple, block-cyclic distributions of (multi-dimensional) arrays

• Collective communication support for redistribution, collective data transfer (transpositions, gather/scatter) and reduction-type operations

• Bulk-operations, array operations

Page 82: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

83

Some PGAS languages/interfaces:

• UPC/UPC++: Unified Parallel C, C/C++ language extension; collective communication support; severe limitations

• CaF: Co-array Fortran, standardized, but limited PGAS extension to Fortran

• CAF2: considerably more powerful, non-standardized Fortran extension

• X10 (Habanero): IBM asynchronous PGAS language• Chapel: Cray, powerful data structure support• Titanium: Java-extension

• Global Arrays (GA): older, PGAS-like library for array programming , see http://hpc.pnl.gov/globalarrays/

• HPF: High-Performance Fortran

Fortran is still an important language in HPC

Page 83: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

84

Mattias De Wael, Stefan Marr, Bruno De Fraine, Tom Van Cutsem, Wolfgang De Meuter: Partitioned Global Address Space Languages. ACM Comput. Surv. 47(4): 62:1-62:27 (2015)

Activity, maturity of PGAS languages?

UPC finds some applications

Martina Prugger, Lukas Einkemmer, Alexander Ostermann: Evaluation of the partitioned global address space (PGAS) model for an inviscid Euler solver. Parallel Computing 60: 22-40 (2016)

No new developments for the past decade? Implementation status and performance not discussed. Many PGAS language implementations use MPI as (default) communication layer

Page 84: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

85

The Earth Simulator: Interconnect

Full crossbar:• Each node has a direct link (cable) to each other node• Full bidirectional communication over each link• All pairs of nodes can communicate simultaneously without

having to share bandwidth• Processors on node shared crossbar bandwidth• Strong: 12.6 GByte/s BW vs. 64GFLOPS/node; for each Byte

communicated ca. 6 FLOPs AI needed in application, otherwise processor idles

Page 85: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

86

Fully connected network, p nodes, floor(p/2) possible pairs, in all pairings all nodes can communicate directly

Maximum distance between any two nodes (diameter): one link

P N NNN

P N NNN

P N NNN

P N NNN

Fully connected network realized as (indirect) crossbar network

Page 86: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

87

Hierarchical/Hybrid communication subsystems

• Processors placed in shared-memory nodes; processors on same node are “closer” than processors on different nodes

• Different communication media within nodes (e.g., shared-memory) and between nodes (e.g., crossbar network)

• Processors on same node share bandwidth of inter-node network

• Compute nodes may have one or more “lanes” (rails) to network(s)

M

P P P P

M

P P P P

M

P P P P

M

P P P P

Communication network

Page 87: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

88

M

P P P P

M

P P P P

M

P P P P

M

P P P P

Communication network

Actually, many more hierarchy levels:• Cache (and memory) hierarchy:

L1 (data/instruction) -> L2 –> L3 (…)• Processors (multi-core) share caches at certain levels

(processor may differ, e.g., AMD vs. Intel)• Network may itself be hierarchical (Clos/fat tree:

InfiniBand): Nodes, Racks, Islands, …

Page 88: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

89

Part 1

Hierarchical communication system

Processors can be partitioned (non-trivially) such that:• Processors in same partition communicate with roughly same

performance (latency, bandwidth, number of ports, …)• Processors in different partitions communicate with roughly

same (lower) performance

Part 0 Part 1 Part k

Processors

Can again be hierarchical

Crossbar network is not hierarchical (all processors can communicate with same performance

Page 89: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

90

“Pure”, homogeneous programming models oblivious to hierarchy• MPI (no performance model, only indirect mechanisms for

grouping processes according to system structure: MPI topologies)

• UPC (local/global, no grouping at all)• …

Implementation challenge for compiler/library implementer to take hierarchy into account:• Point-to-point communication uses closest path, e.g., shared

memory when possible• Efficient, hierarchical collective communication algorithms

exist (for some cases, still incomplete and immature)

Programming model and system hierarchy

Page 90: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

91

“Pure”, homogeneous programming models oblivious to hierarchy

Application programmer relies on language/library to efficiently exploit system hierarchy:

• Portability!• Performance portability?! All library/language functions give

good performance on (any) given system, thus an application whose performance is dominated by library/language function will perform predictable when porting to another system

Sensible to analyze performance in terms of collective operations (building blocks), e.g.,

T(n,p) = TAllreduce(p)+TAlltoall(n)+TBcast(np)+O(n)

Page 91: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

92

Hybrid/heterogeneous programming models (“MPI+X”)

• Conscious to certain aspects/levels of hierarchy• Possibly more efficient application code:

• Example: MPI+OpenMP

• Less portable, less performance portable• Sometimes unavoidable (accelerators): OpenCL, OpenMP,

OpenACC, …

M

P P P P

M

P P P P

M

P P P P

M

P P P P

Communication network

OpenMP

MPI between master threads

Page 92: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

93

Earth simulator 2/SX-9, 2009

Compared to SX-6/Earth Simulator:• More pipes• Special pipes (square root)

Peak performance >100GFLOPS/processor

Page 93: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

94

Peak performance/CPU

102.4GflopsTotal number of CPUs

1280

Peak performance/PN

819.2GflopsTotal number of PNs

160

Shared memory/PN

128GByteTotal peak performance

131Tflops

CPUs/PN 8 Total main memory

20TByte

Earth Simulator 2/SX-9 system

Page 94: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

95

Cheaper communication network than full crossbar: Fat-Tree

Page 95: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

96

Fat-Tree: Indirect (multi-stage), hierarchical network

P P

N

P P

N

P P

N

P P

N

N N

N

Tree network, max 2 log p “hops” between processors, p-1 “wires”

Page 96: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

97

P P

N

P P

N

P P

N

P P

N

N N

N

Bandwidth increases, “fatter” wires

C. E. Leiserson: Fat-Trees: Universal Networks for Hardware-Efficient Supercomputing. IEEE Trans. Computers 34(10): 892-901, 1985

Fat-Tree: Indirect (multi-stage), hierarchical network

Page 97: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

98

P P

N

P P

N

P P

N

P P

N

N N

N

C. E. Leiserson: Fat-Trees: Universal Networks for Hardware-Efficient Supercomputing. IEEE Trans. Computers 34(10): 892-901, 1985

Thinking Machines CM5, on first, unofficial Top500

Fat-Tree: Indirect (multi-stage), hierarchical network

Page 98: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

99

P P

N

P P

N

P P

N

P P

N

N NN N N

N

N

NN N NN Realization with N small crossbar switches

Fat-Tree: Indirect (multi-stage), hierarchical network

Example: InfiniBand

Page 99: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

100

Example: The Blue Gene’s, 2004 (#1)

Page 100: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

101

System Vendor CoresRmax(GFLOPS)

Rpeak(GFLOPS)

BlueGene/L DD2 beta-System (0.7 GHz PowerPC 440)

IBM 32768 70720.00 91750.00

November 2004, Blue Gene/L

Page 101: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

102

Large number of cores (2012: 1572864 – Sequioa system), weaker cores, limited memory per core/node

IBM Blue Gene L• ~200.000 processing cores• 256MBytes to 1G/core

Note:Not possible to locally maintain state of whole system, 256MBytes/200.000 ~ 1KBytes

• Applications that need to maintain state information for each other process in trouble

• Libraries (e.g., MPI) that need to maintain state information for each process in (big) trouble

Page 102: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

103

• “slow” processors, 700-800MHz• Simpler processors, limited out-of-order, branch-prediction• BG/L: 2-core, not cache-coherent• BG/P: 4-core, cache-coherent• BG/Q: ?• Very memory constrained (512MB to 4GB/node)• Simple, low-bisection 3d-torus network

Energy efficient, heavily present on Green500

P P P P

P P P P

P P P P

P P P P

Note:Torus is not a hierarchical network

Page 103: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

104

José E. Moreira, Valentina Salapura, George Almási, Charles Archer, Ralph Bellofatto, Peter Bergner, Randy Bickford, Matthias A. Blumrich, José R. Brunheroto, Arthur A. Bright, Michael Brutman, José G. Castaños, Dong Chen, Paul Coteus, Paul Crumley, Sam Ellis, Thomas Engelsiepen, Alan Gara, Mark Giampapa, Tom Gooding, Shawn Hall, Ruud A. Haring, Roger L. Haskin, Philip Heidelberger, Dirk Hoenicke, Todd Inglett, Gerard V. Kopcsay, Derek Lieber, David Limpert, Patrick McCarthy, Mark Megerian, Michael Mundy, Martin Ohmacht, Jeff Parker, Rick A. Rand, Don Reed, Ramendra K. Sahoo, Alda Sanomiya, Richard Shok, Brian E. Smith, Gordon G. Stewart, Todd Takken, PavlosVranas, Brian P. Wallenfelt, Michael Blocksome, Joe Ratterman: The Blue Gene/L Supercomputer: A Hardware and Software Story. International Journal of Parallel Programming 35(3): 181-206 (2007)

On the BlueGene/L System

Page 104: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

105

George Almási, Charles Archer, José G. Castaños, John A. Gunnels, C. Christopher Erway, Philip Heidelberger, Xavier Martorell, José E. Moreira, Kurt W. Pinnow, Joe Ratterman, Burkhard D. Steinmacher-Burow, William Gropp, Brian R. Toonen:Design and implementation of message-passing services for the Blue Gene/L supercomputer. IBM Journal of Research and Development 49(2-3): 393-406 (2005)

On MPI for the BlueGene/L System

Page 105: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

106

Example: Road Runner, 2008 (#1)

First PetaFLOP system, seriously accelerated

Decommissioned 31.3.2013

Page 106: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

107

System Vendor CoresRmax(TF)

Rpeak(TF)

Power(KW)

BladeCenter QS22/LS21 Cluster, PowerXCell 8i 3.2 Ghz / Opteron DC 1.8 GHz, Voltaire InfiniBand

IBM 129600 1105.0 1456.7 2483.00

November 2008, Road Runner

What counts as a “core”?

Page 107: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

108

• 3240 Nodes• 2x2-core AMD processors • 2 IBM Cell Broadband Engine (CBE)

• InfiniBand interconnect (single rail, 288 port IB switch)

Node

InfiniBand interconnect

Highly imbalanced:Node performance: ~600GFLOPS Communication Bandwidth/node: few Gbytes/s

Early, accelerated system

Page 108: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

109

25,6GByte/sTotal BW>300GByte/s

Standard IBM scalar PowerPC architecture

Multiple ring network with atomic operations

~total 250GFLOPS

Page 109: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

110

25.6 GFLOPS (32-bit!)

• SIMD (128-bit vectors, 4 32-bit words)

• Single-issue, no out-of-order capabilities, limited (no?) branch prediction

Small local storage, 256KB, no cache (no coherency)

Page 110: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

111

Complex, heterogeneous system: Complex programming model (?)

• Deeply hierarchical system: SPE’s -> PPE -> Multi-core -> InfiniBand

• MPI communication between multi-core nodes, either all processors per node or one process per node

• Possibly OpenMP/shared memory model on nodes• Offload to CBE of compute-intensive kernels• CBE programming: PPE/SPE, vectorization, explicit

communication between SPE’s, PPE, node-memory

Road Runner requires very (very) compute intensive applications

Extremely high AI (arithmetic/operational intensity)

Page 111: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

112

MPI communication

• Let the SPEs of the Cell be full-fledged MPI processes• Offload to CPUs as needed/possible

Pakin et al.: The reverse-acceleration model for programming petascale hybrid systems. IBM J. Res. And Dev, (5): 8, 2009

Drawbacks:• Latency high (SPE -> PPE -> CPU -> IB)• Supports only subset of MPI

Page 112: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

113

M

P P P P

M

P P P P

M

P P P P

M

P P P P

Communication network

Single rail: One connection to network, one network

Page 113: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

114

M

P P P P

M

P P P P

M

P P P P

M

P P P P

Communication network

Communication bandwidth can be improved by providing more lanes (rails) to network, and more duplicates of network (multi-rail). Network costs increase proportionally

Examples:• VSC-3 (2014)• Summit, Sierra (2018)

Top500 exercise: Which was the first multi-rail system? When?

Page 114: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

115

• “LIMIEUX”, Pittsburg Supercomputing Center, 2001-2006, dual rail Quadrics https://www.psc.edu/news-publications/30-years-of-psc (Top500 #2, Nov. 2001)

• “Pleiades”, NASA Ames, some nodes with multi-ported InfiniBand (Top500 #11, Nov. 2008)

• “TSUBAME 2.0”, some nodes with 2xInfiniBand (Top500 #4, Nov. 2010)

”Solution” to exercise, information not in Top500

Thanks to Anton Görgl, WS 2019

Page 115: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

116

Example: the Fujitsu K Computer, 2011 (#1)

Page 116: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

117

System Vendor CoresRmax(TF)

Rpeak(TF)

Power(KW)

K computer, SPARC64 VIIIfx 2.0GHz, Tofu interconnect

Fujitsu 548352 8162.0 8773.6 9898.56

June 2011, K-Computer

Page 117: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

118

• High-end, multithreaded, scalar processor (SPARC64 VIIIfx)• Many special instructions• 16GFLOPS per core (Rpeak/#cores)

• 6-dimensional torus

• Homogeneous, no accelerator

Yuichiro Ajima, Tomohiro Inoue, Shinya Hiramoto, Yuzo Takagi, Toshiyuki Shimizu: The Tofu Interconnect. IEEE Micro 32(1): 21-31 (2012)Yuichiro Ajima, Shinji Sumimoto, Toshiyuki Shimizu: Tofu: A 6D Mesh/Torus Interconnect for Exascale Computers. IEEE Computer 42(11): 36-40 (2009)

Page 118: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

119

Examples: Other accelerator-based systems

Page 119: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

120

November 2013, TianHe-2

System Vendor CoresRmax(TF)

Rpeak(TF)

Power(KW)

TH-IVB-FEP Cluster, Intel Xeon E5-2692 12C 2.200GHz, TH Express-2, Intel Xeon Phi 31S1P

NUDT 3,120,000 33,862.7 54,902.4 17,808.00

Page 120: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

121

System Vendor CoresRmax(TF)

Rpeak(TF)

Power(KW)

Cray XK7,Opteron 6274 16C 2.200GHz, Cray Gemini interconnect, NVIDIA K20x

Cray 560640 17590.0 27112.5 8209.00

November 2012, Cray Titan

Page 121: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

122

System Vendor Cores Rmax RpeakPower (KW)

PowerEdgeC8220, Xeon E5-2680 8C 2.700GHz, InfinibandFDR, Intel Xeon Phi

Dell 462462 5,168.1 8,520.1 4,510.00

November 2012, Stampede (#7)

Page 122: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

123

System Vendor Cores Rmax Rpeak Power

NUDT TH MPP, X5670 2.93Ghz 6C, NVIDIA GPU, FT-1000 8C

NUDT 186368 2566.0 4701.0 4040.00

November 2010, Tianhe

Page 123: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

124

Hybrid architectures with accelerator support (GPU, MIC)

• High-performance and low energy consumption through accelerators

• GPU accelerator: Highly parallel “throughput architecture”, lightweight cores, complex memory hierarchy, banked memory

• MIC accelerator: Lightweight x86 cores, extended vectorization, ring-network on chip

Page 124: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

125

Hybrid architectures with accelerator support (GPU, MIC)

Issues with accelerator: currently (2013) limited on-chip memory (MIC 8GByte), PCIex connection to main processor

Programming: Kernel offload, explicitly with OpenCL/CUDA

MIC: Some “reverse acceleration” projects, MPI between MIC cores

Page 125: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

126

Main memory

Lk

L1

SIMD

Acc

Heavily accelerated system, one or more accelerators

Acc memory

Memory hierarchy

This will likely change

Although same ISA, heterogeneous programming model (offloading) may be needed

OpenMP, OpenACC, … (+MPI)

News late 2017: KNL… line discontinued by Intel

Page 126: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

127

Accelerators for Exascale?

Energy consumption and cooling obstacles for reaching ExaFLOPS

Energy consumed in• Processor (heat, leak)• Memory system• Interconnect

“Solution”:Massive amount of simple, low-frequency processors; weak(er) interconnects; deep memory hierarchy

Page 127: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

128

Run-of-the-mill

System Vendor Cores Rmax Rpeak Power

Megware Saxonid6100, Opteron 8C 2.2 GHz, InfinibandQDR

Megware 20776 152.9 182.8 430.00

VSC-2, June 2011, November 2012: #162

Similar to TU Wien, Parallel Computing group “jupiter”

Page 128: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

129

Run-of-the-mill

System Vendor Cores Rmax Rpeak Power

Oil blade server, Intel Xeon E5-2650v2 8C 2.6GHz, Intel TrueScaleInfiniband

ClusterVision

32,768 596.0 681.6 450.00

VSC-3 November 2014 #85; November 2015 #138; November 2016 #246; November 2017 #460

• Innovative oil cooling• Dual rail InfiniBand

VSC-4 is coming 2019

Page 129: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

130

Memory in HPC systems (2015)

System #Cores Memory (GB)

Memory/Core (GB)

TianHe-2 3,120,000 1,024,000 0,33

Titan (Cray XK) 560,640 710,144 1,27

Sequoia (IBM BG/Q) 1,572,864 1,572,864 1

K (Fujitsu SPARC) 705,024 1,410,048 2

Stampede (Dell) 462,462 192,192 0,42

Roadrunner (IBM) 129,600 ?

Pleiades (SGI) 51,200 51,200 1

BlueGene/L (IBM) 131,072 32,768 0,25

Earth Simulator (SX9) 1,280 20,480 16

Earth Simulator (SX6) 5,120 ~10,000 1,95

Page 130: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

131

Memory/core in HPC systems

• What is a core (GPU SIMD core)?

• Memory a scarce resource, not possible to keep state information for all cores

• Hybrid, shared memory programming models may help to keep shared structures once/node

• Algorithms must use memory efficiently: in-place, no O(n2) representations for O(n+m) sized graphs, …

Page 131: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

132

Not easily found in Top500 list

Details on interconnect only indirectly available:

• Bandwidth/node, bandwidth/core• Bisection bandwidth• Number of communication ports/node

Fully connected, direct: high bisection, low diameter, contention free

(Fat)tree: logarithmic diameter, high bisection possible, contention possible

Torus/Mesh: low bisection, high diameter

#cores?

Page 132: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

133

Summary: Exploiting (HPC) systems well

• Understand computer architecture: Processor capabilities (pipeline, branch predictor, speculation, vectorization, …) memory system (cache-hierarchy, memory network)

• Understand communication networks (structure: diameter, bisection width, practical realization: NIC, communication processors)

• Understand programming model, and realization: language, interface, framework; algorithms and data structures

Co-design: Application, programming model, architecture

An HPC system works best for the applications for which it is targeted (AI); there are always tradeoffs

Page 133: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

134

Summary: What is HPC?

Study of• Computer architecture, memory systems• Communication networks• Programming models and interfaces• (Parallel) Algorithms and data structures, for applications and

for interface support

• Assessment of computer systems: Performance models, rigorous benchmarking

For Scientific Computing (applications):• Tools, libraries, packages• (Parallel) Algorithms and data structures

Page 134: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

135

Hennessy, Patterson: Computer Architecture – A Quantitative Approach (5 Ed.). Morgan Kaufmann, 2012

Bryant, O’Halloran: Computer Systems. Prentice-Hall, 2003

Georg Hager, Jan Treibig, Johannes Habich, Gerhard Wellein:Exploring performance and power properties of modern multi-core chips via simple machine models. Concurrency and Computation: Practice and Experience 28(2): 189-210 (2016)

Georg Hager, Gerhard Wellein: Introduction to High Performance Computing for Scientists and Engineers. Chapman and Hall / CRC computational science series, CRC Press 2011, ISBN 978-1-439-81192-4, pp. I-XXV, 1-330

Processor architecture models

Roofline model This lecture

Page 135: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

136

Memory system

Cache system basics

Georg Hager, Gerhard Wellein: Introduction to High Performance Computing for Scientists and Engineers. Chapman and Hall / CRC computational science series, CRC Press 2011, ISBN 978-1-439-81192-4, pp. I-XXV, 1-330

• Cache-aware algorithm: Algorithm that uses memory (cache) hierarchy efficiently, under knowledge of the number of levels, cache and cache line sizes

• Cache-oblivious algorithm: Algorithm that uses memory hierarchy efficiently, without explicitly knowing cache system parameters (cache and line sizes)

• Cache-replacement strategies Not this year

Page 136: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

137

Matteo Frigo, Charles E. Leiserson, Harald Prokop, SridharRamachandran: Cache-Oblivious Algorithms. ACM Trans. Algorithms 8(1): 4 (2012), results dating back to FOCS 1999

• Cache-aware algorithm: Algorithm that uses memory (cache) hierarchy efficiently, under knowledge of the number of levels, cache and cache line sizes

• Cache-oblivious algorithm: Algorithm that uses memory hierarchy efficiently, without explicitly knowing cache system parameters (cache and line sizes)

• Cache-replacement strategies

Page 137: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

138

Memory system: Multi-core memory systems (NUMA)

Georg Hager, Gerhard Wellein: Introduction to High Performance Computing for Scientists and Engineers. Chapman and Hall / CRC computational science series, CRC Press 2011, ISBN 978-1-439-81192-4, pp. I-XXV, 1-330

Memory efficient algorithms: External memory model, in-place algorithms, …

Not this year

Page 138: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

139

Communication networks

• Network topologies• Routing• Modeling Some in MPI part of lecture, by need

Communication library

Efficient communication algorithms for given network assumptions inside MPI

In MPI part of lecture

Page 139: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

140

Completely different case-study: Context allocation in MPI

Process i: MPI_Send(&x,c,MPI_INT,j,TAG, comm);

Process j: MPI_Recv(&y,c,MPI_INT,j,TAG,comm,&status);

Process j receives messages with TAG on comm in order

MPI_Send(…,j,TAG,other);no match: no communication if comm!=other

Implementation of point-to-point communication:Message envelope contains communication context, unique to comm, to distinguish messages on different communicators

Page 140: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

141

Tradeoff: number of bits for communication context vs. number of communication contexts.Sometimes: 12 bits, 14 bits, 16 bits… (4K to 16K possible communicators)

Implementation challenges: Small envelope

Recall:• Communicators in MPI essential for safe parallel libraries,

tags not sufficient (library routines written by different people might use same tags)

• Communicators in MPI essential for algorithms that require collective communication on subsets of processes

Page 141: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

142

MPI_Comm: MPI_COMM_WORLD

i j

MPI_Comm: local structure representing distributed communicator object

MPI_Recv(…,comm,&status);MPI_Send(…,comm);

MPI_COMM_WORLD: Default communicator, all processes

MPI_Comm_create(), MPI_Comm_split(), MPI_Dist_graph_create(), …: collective operations to create new communicators out of old ones

Page 142: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

143

MPI_Comm_create(), MPI_Comm_split(), MPI_Dist_graph_create(), …: collective operations to create new communicators out of old ones

1. Determine which other processes will belong to new communicator

2. Allocate context id: maintain global bitmap of used id’s

Algorithm scheme, process i:

Standard implementation:

Use 4K to 16K bit vector bitmap to keep track of free communication contexts. If bitmap[i]==0, then i is a free communication context

unsigned long bitmap[words];

Page 143: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

144

MPI_Comm: MPI_COMM_WORLD

MPI_Comm MPI_Comm

MPI_Comm

MPI_Comm

MPI_Comm MPI_CommMPI_Comm

Problem: Ensure that all processes in new communicator have same communication context by using same bitmap

Page 144: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

145

unsigned long bitmap[words], newmap[words];

MPI_Allreduce(bitmap,newmap,words,MPI_LONG,MPI_BOR,

comm);

Important fact ( will see later in lecture): For any reasonable network N, it holds that

Time(MPI_Allreduce(m)) = O(max(diam(N),log p)+m)

Step 2.1:Since all communicator creating operations are collective, use collective MPI_Allreduce() to generate global bitmap representing all used communication contexts

Bitwise OR

Page 145: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

146

Typical MPI_Allreduce performance (function of problem size, fixed number of processes, p=26*16)

Time is constant for m≤K, for some small K

Use K as size of bitmap?

“jupiter” IB cluster at TU Wien“Minimum recorded time, no error bars”

Page 146: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

147

for (i=0; i<words; i++) if (newmap[i]!=0xF…FL) break;

unsigned long x = newmap[i];

for (z=0; z<8*sizeof(x); z++)

if ((x&0x1)==0x0) break; else x>>=1;

O(words) operations

O(wordlength), dominates if words<wordlength

Step 2.2:Find first word with 0-bit

Step 2.3:Find rightmost (first) 0-bit in word

64 words of 64-bits = 4K communication contexts

Can we do better?

Page 147: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

148

Find “first 0 from right”, faster methods

Here: 16-bit word

Method 1: Architecture has lsb(x) instruction (“least significant bit of x”, O(1) operations

z = lsb(~x);

0 0 1 0 1 1 0 1 0 1 1 1 1 1 1 1

1 1 0 1 0 0 1 0 1 0 0 0 0 0 0 0

General challenge:Useful bit operations in O(1) or O(log w) operations for words of length w (note O(log w) = O(log log n), pretty fast)

Page 148: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

149

0 0 1 0 1 1 0 1 0 1 1 1 1 1 1 1

Method 2: Architecture has “popcount” instruction pop(x) (population count, number of 1’s in x), O(1) operations

x = x&~(x+1);

z = pop(x);0 0 1 0 1 1 0 1 1 0 0 0 0 0 0 0

1 1 0 1 0 0 1 0 0 1 1 1 1 1 1 1

0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1

z = pop(x) = 7;

Here: 16-bit word

Page 149: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

150

z = 0;

if ((x&0x0000FFFF) == 0x0000FFFF) { z = 16; x >>= 16; }

if ((x&0x000000FF) == 0x000000FF) { z += 8; x >>= 8; }

if ((x&0x0000000F) == 0x0000000F) { z += 4; x >>= 4; }

if ((x&0x00000003) == 0x00000003) { z += 2; x >>= 2; }

z += (x&0x1);

0 0 1 0 1 1 0 1 0 1 1 1 1 1 1 1

Method 3: Direct; binary search, O(log w) operations, O(1) for fixed w

Here: 16-bit word

for 32-bit word

Early mpich (mid-90ties)

Page 150: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

151

z = 0;

if ((x&0x0000FFFF) == 0x0000FFFF) { z = 16; x >>= 16; }

if ((x&0x000000FF) == 0x000000FF) { z += 8; x >>= 8; }

if ((x&0x0000000F) == 0x0000000F) { z += 4; x >>= 4; }

if ((x&0x00000003) == 0x00000003) { z += 2; x >>= 2; }

z += (x&0x1);

0 0 1 0 1 1 0 1 0 1 1 1 1 1 1 1

0 0 1 0 1 1 0 1 0 1 1 1 1 1 1 1

0 0 0 0 0 0 1 0 1 1 0 1 0 1 1 1

z = 0

z = 4

0 0 0 0 0 0 0 0 1 0 1 1 0 1 0 1 z = 6 z = 7

for 32-bit word

Page 151: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

152

x = ~x; // invert bits

if (x==0) z = 32; else {

z = 0;

if ((x&0x0000FFFF) == 0x0) { z = 16; x >>= 16; }

if ((x&0x000000FF) == 0x0) { z += 8; x >>= 8; }

if ((x&0x0000000F) == 0x0) { z += 4; x >>= 4; }

if ((x&0x00000003) == 0x0) { z += 2; x >>= 2; }

z -= (x&0x1);

}

0 0 1 0 1 1 0 1 0 1 1 1 1 1 1 1

Method 3a: direct; binary search to find lsb

Here: 16-bit word

Might be better because masks needed only once

for 32-bit word

Page 152: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

153

x = x&~(x+1);

x = (x&0x55555555) + ((x>>1)&0x55555555);

x = (x&0x33333333) + ((x>>2)&0x33333333);

x = (x&0x0F0F0F0F) + ((x>>4)&0x0F0F0F0F);

x = (x&0x00FF00FF) + ((x>>8)&0x00FF00FF);

x = (x&0x0000FFFF) + ((x>>16)&0x0000FFFF);

0 0 1 0 1 1 0 1 0 1 1 1 1 1 1 1

Method 4: implement popcount

Exploits word parallelism. And is branchfree

Here: 16-bit word

for 32-bit word

popcount

Page 153: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

154

0 0 1 0 1 1 0 1 0 1 1 1 1 1 1 1

Idea:

pop

0 1 1 1 1 1 1 10 0 1 0 1 1 0 1=

pop + pop

…and recurse

Page 154: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

155

x = (x&0x55555555) + ((x>>1)&0x55555555);

x = (x&0x33333333) + ((x>>2)&0x33333333);

x = (x&0x0F0F0F0F) + ((x>>4)&0x0F0F0F0F);

x = (x&0x00FF00FF) + ((x>>8)&0x00FF00FF);

x = (x&0x0000FFFF) + ((x>>16)&0x0000FFFF);

Observation: pop(x) for k-bit word x at most k; so pop(x) fits in word x

0 0 1 0 1 1 0 1 0 1 1 1 1 1 1 1 pop(10) = ((10>>1)&0x1)+(10&0x1) = 1

pop(11) =((11>>1)&0x1)+(11&0x1) = 2

for 32-bit word

Page 155: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

156

x = ~(~x&(x+1));

x = x-((x>>1)&0x55555555);

x = (x&0x33333333) + ((x>>2)&0x33333333);

x = (x+(x>>4)) & 0x0F0F0F0F;

x += (x>>8);

x += (x>>16);

z = x&0x0000003F;

0 0 1 0 1 1 0 1 0 1 1 1 1 1 1 1

Method 4a: implement popcount, improved

Here: 16-bit word

for 32-bit word

Exercise: Figure out what this does and why it works

Page 156: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

157

Preprocessing for FFT: Bit reversal

Bit-reversal permutation often needed, e.g., efficient Fast Fourier Transform (FFT):B[r(i)] = A[i], where r(i) is the number arising from reversing the bits in the binary representation of i

Examples:

r(111000) = 000111r(10111) = 11101r(101101) = 101101

General: r(ab) = r(b)r(a)

Page 157: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

158

x = ((x&0x55555555)<<1) | ((x&0xAAAAAAAA)>>1);

x = ((x&0x33333333)<<2) | ((x&0xCCCCCCCC)>>2);

x = ((x&0x0F0F0F0F)<<4) | ((x&0xF0F0F0F0)>>4);

x = ((x&0x00FF00FF)<<8) | ((x&0xFF00FF00)>>8);

x = ((x&0x0000FFFF)<<16) | ((x&0xFFFF0000)>>16);

for 32-bit word

r(a) for 32-bit word: Recursively, in parallel; branch-free:

0 0 1 0 1 1 0 1 0 1 1 1 1 1 1 1

0 0 0 1 1 1 1 0 0 1 1 1 1 1 1 1

1 0 1 1 0 1 0 0 1 1 1 1 1 1 1 0

0 1 0 0 1 0 1 1 1 1 1 0 1 1 1 1

1 1 1 1 1 1 1 0 1 0 1 1 0 1 0 0

Note: the assignments can be done in any order

Page 158: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

159

x = ((x&0x55555555)<<1) | ((x>>1)&0x55555555);

x = ((x&0x33333333)<<2) | ((x>>2)&0x33333333);

x = ((x&0x0F0F0F0F)<<4) | ((x>>4)&0x0F0F0F0F);

x = (x<<24)| ((x&0xFF00)<<8) | ((x>>8)&0xFF00) |

(x>>24);

for 32-bit word

And perhaps even better (reuse of constants)

Page 159: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

160

Finding largest/smallest power of 2

int ispoweroftwo(long x) {

return (((x-1)&x)==0);

}

Detect whether x is a power-of-two; find smallest y=2k with y≥x

Find smallest: lsb(r(x)).

x = x-1;

x = x | (x>>1);

x = x | (x>>2);

x = x | (x>>4);

x = x | (x>>8);

x = x | (x>>16); x++;

for 32-bit word

Better, direct method:

Exercise: Find largest k s.t.2k≤x (aka msb(x), see lsb(x))

Page 160: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

162

“If you write optimizing compilers or high-performance code, you must read this book”,Guy L. Steele, Foreword to “Hackers Delight”, 2002

D. E. Knuth: “The Art of Computer Programming”, Vol . 4, Section 7.1.3, Addison-Wesley, 2011D. E. Knuth: “MMIXWare: A RISC Computer for the Third Millenium”, LNCS 1750, 1999 (new edition 2014)

See alsohttp://graphics.stanford.edu/~seander/bithacks.html

Are such things relevant?

Page 161: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

163

Michael Pippig: PFFT: An Extension of FFTW to Massively Parallel Architectures. SIAM J. Scientific Computing 35(3) (2013)

Matteo Frigo, Steven G. Johnson: FFTW: an adaptive software architecture for the FFT. ICASSP 1998: 1381-1384Matteo Frigo: A Fast Fourier Transform Compiler. PLDI 1999: 169-180

Kang Su Gatlin, Larry Carter: Memory Hierarchy Considerations for Fast Transpose and Bit-Reversals. HPCA 1999:33-42

Larry Carter, Kang Su Gatlin: Towards an Optimal Bit-Reversal Permutation Program. FOCS 1998:544-555

Interesting further reading

Page 162: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

164

Not the end of the story

MPI_Comm_split(oldcomm,color,key,&newcomm);

All processes (in oldcomm) that supply the same color (or MPI_UNDEFINED) will belong to same newcomm, ordered by key, tie-break by rank in oldcomm

Problem: rank i supplying color c needs to determine which other processes also supplied color c

Trivial solution: all processes gather all colors and keys (MPI_Allgather), sort lexicographically to determine rank in newcomm

Early mpich (mid-90ties): bubblesort!!!

Page 163: High Performance Computing - par.tuwien.ac.at · High Performance Computing: A (biased) overview Concerns: Either 1. Achieving highest possible performance as needed by some application(s)

©Jesper Larsson TräffWS19

165

Siebert, Wolf: „Parallel Sorting with Minimal Data“. EuroMPI2011, LNCS 6960: 170-177A. Moody, D. H. Ahn, B. R. de Supinski: Exascale Algorithms forGeneralized MPI_Comm_split. EuroMPI 2011, LNCS 6960: 9-18

Better solutions:

• Different, O(p log p) sort• Modified allgather algorithm to merge on the fly• …