Parallel Programming Concepts Shared Nothing Parallelism · Parallel Programming Concepts Shared Nothing Parallelism ... LAN protocol, ... because the chips are packaged with two

Parallel Programming Concepts Shared Nothing Parallelism

Dr. Peter Tröger M.Sc. Frank Feinbube

Parallel Processing

■  Inside the processor □  Instruction-level

parallelism (ILP) □  Multicore

□  Shared memory ■  With multiple processing

elements in one machine □  Multiprocessing □  Shared memory

■  With multiple processing elements in many machines □  Multicomputer

□  Shared nothing (in terms of a globally accessible memory)

2

Clusters

■  Collection of stand-alone machines connected by a local network □  Cost-effective technique for a large-scale parallel computer

□  Low cost of both hardware and software □  Users are builders, have control over their own system

(hardware and software), low costs as major issue ■  Distributed processing as extension of DM-MIMD

□  Communication orders of magnitude slower than with SM □  Only feasible for coarse-grained parallel activities

3

Internet Web Server

Web Server

Web Server

Web Server

Load Balancer

Clusters

4

History of Clusters

■  1977: ARCnet (Datapoint) □  LAN protocol, DATABUS programming language

□  Single computer with terminals □  Addition of ‚compute resource‘ and ‚data resource‘ computers

transparent for the application ■  May 1983: VAXCluster (DEC)

□  Cluster of VAX computers, no single-point-of-failure

□  Every duplicated □  High-speed messaging interconnect

Distributed version of VMS OS ■  Distributed lock manager for shared

resources

5

History of Clusters - NOW

■  Berkeley Network Of Workstations (NOW) - 1995 ■  Building large-scale parallel computing system with COTS

hardware ■  GLUnix operating system

□  Transparent remote execution, network PID‘s

□  Load balancing □  Virtual Node Numbers (for communication)

■  Network RAM - idle machines as paging device ■  Collection of low-latency, parallel communication primitives - ‘active messages’

■  Berkeley sockets, shared address space parallel C, MPI

6

Cluster System Classes

■  High-availability (HA) clusters – Improvement of cluster availability □  Linux-HA project (multi-protocol heartbeat, resource grouping)

■  Load-balancing clusters – Server farm for increased performance / availability □  Linux Virtual Server (IP load & application-level balancing)

■  High-performance computing (HPC) clusters – Increased performance by splitting tasks among different nodes □  Speed up the computation of one distributed job (FLOPS)

■  High-throughput computing (HTC) clusters – Maximize the number of finished jobs □  All kinds of simulations, especially parameter sweep □  Special case: Idle Time Computing for cycle harvesting

7

Massively Parallel Processing (MPP)

■  Hierarchical SIMD / MIMD architecture with a lot of processors □  Still standard components (in contrast to mainframes)

□  Specialized setup of these components □  Host nodes responsible for loading program and data to PE‘s □  High-performance interconnect

(bus, ring, 2D mesh, hypercube, tree, ...) □  For embarrassingly parallel applications, mostly simulations

(atom bomb, climate, earthquake, airplane, car crash, ...) ■  Examples

□  Distributed Array Processor (1979), 64x64 single bit PEs

□  BlueGene/L (2007), 106.496 nodes x 2 PowerPC (700MHz) □  IBM Sequoia (2012), 16,3 PFlops, 1.6 PB memory,

98304 compute nodes, 1.6 Million cores, 7890 kW power

8

BlueGene / L

9

2 Evolution of the IBM System Blue Gene Solution

1.1 View from the outsideThe Blue Gene/P system has the familiar, slanted profile that was introduced with the Blue Gene/L system. However the increased compute power requires an increase in airflow, resulting in a larger footprint. Each of the air plenums on the Blue Gene/P system are just over ten inches wider than the plenums of the previous model. Additionally, each Blue Gene/P rack is approximately four inches wider. There are two additional Bulk Power Modules mounted in the Bulk Power enclosure on the top of the rack. Rather than a circuit breaker style switch, there is an on/off toggle switch to power on the machine.

1.1.1 Packaging

Figure 1-1 illustrates the packaging of the Blue Gene/L system.

Figure 1-1 Blue Gene/L packaging

Figure 1-2 on page 3 shows how the Blue Gene/P system is packaged. The changes start at the lowest point of the chain. Each chip is made up of four processors rather than just two processors like the Blue Gene/L system supports.

At the next level, only one chip is on each of the compute (processor) cards. This design is easier to maintain with less waste. On the Blue Gene/L system, the replacement of a compute node because of a single failed processor requires the discard of one usable chip because the chips are packaged with two per card. The design of the Blue Gene/P system has only one chip per processor card, eliminating the disposal of a good chip when a compute card is replaced.

Each node card still has 32 chips, but now the maximum number of I/O nodes per node card is two, so that only two Ethernet ports are on the front of each node card. Like the Blue Gene/L system, there are two midplanes per rack. The lower midplane is considered to be the

2.8/5.6 GF/s4 MB

2 processors

2 chips, 1x2x1

5.6/11.2 GF/s1.0 GB

(32 chips 4x4x2)16 compute, 0-2 IO cards

90/180 GF/s16 GB

32 node cards

2.8/5.6 TF/s512 GB

64 Racks, 64x32x32

180/360 TF/s32 TB

Rack

System

Node card

Compute card

Chip

Blue Gene/P

10

© 2009 IBM Corporation

Blue Gene/P

13.6 GF/s8 MB EDRAM

4 processors

1 chip, 20 DRAMs

13.6 GF/s2.0 GB DDR2

(4.0GB 6/30/08)

32 Node Cards

13.9 TF/s2 (4) TB

72 Racks, 72x32x32

1 PF/s144 (288) TB

Cabled 8x8x16Rack

System

Compute Card

Chip

435 GF/s64 (128) GB

(32 chips 4x4x2)32 compute, 0-1 IO cards

Node Card

Blue Gene/Q

11


IBM System Technology Group

1. Chip:16+2 !P

cores

2. Single Chip Module

4. Node Card:32 Compute Cards, Optical Modules, Link Chips; 5D Torus

5a. Midplane: 16 Node Cards

6. Rack: 2 Midplanes

7. System: 96 racks, 20PF/s

3. Compute card:One chip module,16 GB DDR3 Memory,Heat Spreader for H2O Cooling

5b. IO drawer:8 IO cards w/16 GB8 PCIe Gen2 x8 slots3D I/O torus

•Sustained single node perf: 10x P, 20x L• MF/Watt: (6x) P, (10x) L (~2GF/W, Green 500 criteria)• Software and hardware support for programming models for exploitation of node hardware concurrency

Blue Gene/Q

Blue Gene/Q

12



! 360 mm² Cu-45 technology (SOI)

! 16 user + 1 service PPC processors – plus 1 redundant processor– all processors are symmetric– 11 metal layer– each 4-way multi-threaded– 64 bits– 1.6 GHz– L1 I/D cache = 16kB/16kB– L1 prefetch engines– each processor has Quad FPU

(4-wide double precision, SIMD)– peak performance 204.8 GFLOPS @ 55 W

! Central shared L2 cache: 32 MB – eDRAM– multiversioned cache – supports transactional memory,

speculative execution.– supports scalable atomic operations

! Dual memory controller– 16 GB external DDR3 memory– 42.6 GB/s DDR3 bandwidth (1.333 GHz DDR3)

(2 channels each with chip kill protection)

! Chip-to-chip networking– 5D Torus topology + external link

" 5 x 2 + 1 high speed serial links– each 2 GB/s send + 2 GB/s receive– DMA, remote put/get, collective operations

! External (file) IO -- when used as IO chip.– PCIe Gen2 x8 interface (4 GB/s Tx + 4 GB/s Rx)– re-uses 2 serial links– interface to Ethernet or Infiniband cards

System-on-a-Chip design : integrates processors, memory and networking logic into a single chip

BlueGene/Q Compute chip

Blue Gene/Q

13

© 2009 IBM Corporation26

Blue Gene/Q System Architecture

Functional Network

10Gb QDR

Functional Network

10Gb QDR

I/O Node

Linux

ciod

C-Node 0

CNK

I/O Node

Linux

ciod

C-Node 0

CNK

C-Node n

CNK

C-Node n

CNK

Control Ethernet

(1Gb)

Control Ethernet

(1Gb)

FPGA

LoadLeveler

SystemConsole

MMCS

JTAG

torus

collective network

DB2

Front-endNodes

FileServers

fs client

fs client

Service Node

app app

appapp

optic

al

optic

al

Blue Gene/Q

14



BG/Q Software Stack Openness

New open source community under CPL license. Active IBM participation.Existing open source communities under various licenses. BG code will be contributed and/or new sub-community started..

New open source reference implementation licensed under CPL.

Appl

icat

ion

Syst

emFi

rmw

are

HW

ESSL

MPI Charm++Global Arrays

XL RuntimeGNU Runtime

PAMI(Converged

Messaging Stack)

Compute Node Kernel

(CNK)

CIO Services

Linux kernel

MPI-IO

Diagnostics

Compute nodes I/O nodesUs

er/S

ched

Syst

emFi

rmw

are

HW SN LNs

Low Level Control SystemPower On/Off, Hardware probe,

Hardware init, Parallel monitoringParallel boot, Mailbox

ISVSchedulers,debuggers

Service cardsNode cards

Node SPIs

totalviewd

DB2

Loadleveler

GPFS

BGMon

runjob Sched API

BG Nav

I/O and Compute Nodes Service Nodes/Login Nodes

HPC Toolkit

Closed. No source provided. Not buildable.

Messaging SPIs

High Level Control System (MMCS)Partitioning, Job management and

monitoring, RAS, Administrator interface

Node FirmwareInit, Bootloader, RAS,

Recovery Mailbox

TEAL

XL CompilersGNU Compilers

Diag Harness BGWSBG master

SSNs

MPP Properties

■  Standard components (processors, harddisks, ...) ■  Specific non-standardized interconnection network

□  Low latency, high speed; distributed file system ■  Specific packaging of components and nodes for cooling and

upgradeability □  Whole system provided by one vendor (IBM, HP) □  Extensibility as major issue, in order to save investment □  Distributed processing as extension of DM-MIMD

■  Scalability only limited by application, and not by hardware ■  Proprietary wiring of standard components ■  Demands custom operating system and aligned applications ■  No major consideration of availability

■  Power consumption, cooling

15

Distributed System

■  Tanenbaum (Distributed Operating Systems): „A distributed system is a collection of independent computers that appear to the users of the system as a single computer.“

■  Coulouris et al.: „... [system] in which hardware or software components located at networked computers communicate and coordinate their actions only by passing messages.“

■  Lamport: „A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable.”

■  Consequences: concurrency, no global clock, independent failures ■  Challenges: heterogeneity, openness, security, scalability, failure

handling, concurrency, need for transparency

16

SMP vs. Cluster vs. Distributed System

■  Clusters are composed of computers, SMPs are composed of processors □  High availability is cheaper with clusters, but demands

additional software components □  Scalability is easier with a cluster

□  SMPs are easier to maintain from administrators point of view □  Software licensing becomes more expensive with a cluster

■  Clusters for capability computing, integrated machines for capacity computing

■  Cluster vs. Distributed System □  Both contain of multiple nodes for parallel processing

□  Nodes in a distributed system have their own identity □  Physical vs. virtual organization

17

Comparison

18 MPP SMP Cluster Distributed

Number of nodes O(100)-O(1000) O(10)-O(100) O(100) or less O(10)-O(1000)

Node Complexity Fine grain Medium or coarse grained Medium grain Wide range

Internode communication

Message passing / shared variables (SM)

Centralized and distributed shared

memory Message Passing

Shared files, RPC, Message Passing,

IPC

Job scheduling Single run queue on host

Single run queue mostly

Multiple queues but coordinated Independent queues

SSI support Partially Always in SMP Desired No

Address Space Multiple Single Multiple or single Multiple

Internode Security Irrelevant Irrelevant Required if exposed Required

Ownership One organization One organization One or many organizations Many organizations

K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming; WCB/McGraw-Hill, 1998

Interconnection Networks

■  Shared-Nothing systems demand structured connectivity □  Processor-to-processor interaction

□  Processor-to-memory interaction ■  Static network

□  Point-to-point links, fixed route

■  Dynamic network □  Consists of links

and switching elements

□  Flexible configuration of processor interaction

19


20


■  Dynamic networks are built from a graph of configurable switching elements

■  General packet switching network counts as irregular static network

21

[Peter New

man]


■  Network Interfaces □  Processors talk to the network via a network interface

connector (NIC) hardware □  Network interfaces attached to the interconnect

◊ Cluster vs. tightly-coupled multi-computer □  Next generation hardware will include NIC in the processor die

■  Switching elements map a fixed number of inputs to outputs □  Total number of ports is the degree of the switch

□  The cost of a switch grows as square of the degree □  The peripheral hardware grows linearly as the degree


■  A variety of network topologies proposed and implemented ■  Each topology has a performance / cost tradeoff

■  Commercial machines often implement hybrids □  Optimize packaging and costs

■  Metrics for an interconnection network graph □  Diameter: Maximum distance between any two nodes

□  Connectivity: Minimum number of edges that must be removed to get two independent graphs

□  Link width / weight: Transfer capacity of an edge □  Bisection width: Minimum transfer capacity given between

any two halves of the graph □  Costs: Number of edges in the network

■  Often optimization for connectivity metric

Bus Systems

■  Static interconnect technology ■  Shared communication path, broadcasting of information

□  Diameter: O(1) □  Connectivity: O(1) □  Bisection width: O(1) □  Costs: O(p)

24

…

Crossbar Switch

■  Dynamic switch-based network ■  Non-blocking

■  Supports multiple connections without collisions

■  Diameter: O(1) ■  Connectivity: O(1) ■  Bisection width: O(n)

■  Costs: O(n2) □  High costs with

quadratic growth, bad scalability

■  N*(n-1) connection points

25 07.01.2013

40

Crossbar switch "(Kreuzschienenverteiler)

• Arbitrary number of

permutations

• Collision-free data

exchange

• High cost, quadratic

growth

• n * (n-1) connection points

79

Delta networks

• Only n/2 log n delta-

switches

• Limited cost

• Not all possible

permutations

operational in parallel

80

07.01.2013

40

Crossbar switch "(Kreuzschienenverteiler)

• Arbitrary number of

permutations

• Collision-free data

exchange

• High cost, quadratic

growth

• n * (n-1) connection points

79

Delta networks

• Only n/2 log n delta-

switches

• Limited cost

• Not all possible

permutations

operational in parallel

80

Crossbar Switch

26

Crossbar Switch

27

Multistage Interconnection Networks

■  Connection by switching elements ■  Typical solution to connect processing and memory elements

■  Can implement sorting or shuffling in the network routing

28

Omega Network

29 ■  Inputs are crossed or not, depending on routing logic □  Destination-tag routing: Use positional bit for switch decision

□  XOR-tag routing: Use positional bit of XOR result for decision ■  For N PE’s, N/2 switches per stage, log2N stages ■  Decrease bottleneck probability on parallel communication

Delta Networks

■  Stage n checks bit k of the destination tag

■  Only (n/2 * log n) delta switches needed

■  Limited cost

■  Not all possible permutations operational in parallel

■  Possible effect of ‚output port contention‘ and ‚path contention‘

30 0 1

2 3

4

5

6 7

Close Coupling – Delta Networks and Crossbar

31

07.01.2013

41

Clos coupling networks

• Combination of delta

network and crossbar

81

C.Clos, A Study of Nonblocking Switching Networks, "

Bell System Technical Journal, vol. 32, no. 2, "1953, pp. 406-424(19)

Fat-Tree networks

• PEs arranged as leafs on a binary tree

• Capacity of tree (links) doubles on each layer

82

Bitonic Mergesort

32

Completety Connected / Star Connected Networks

33

Cartesian Topology Network

34

Linear Arrays

2D and 3D Meshes

Cartesian Topology Network

■  Linear array: Each node has two neighbours ■  1D torus / ring: Linear array with connected endings

■  2D torus / mesh: Each node has four neighbours ■  d-dimensional mesh: Nodes with 2d neighbours ■  Hypercube

□  d-dimensional mesh where d=log n (# processors)

□  Construction of hypercube from lower dimensional hypercube

35

07.01.2013

42

Point-to-point networks: "ring and fully connected graph

• Ring has only two connections per PE (almost optimal)

• Fully connected graph – optimal connectivity (but high cost)

83

Mesh and Torus

• Compromise between cost and connectivity

84

4-way 2D mesh 4-way 2D torus 8-way 2D mesh

Hypercubes

36

07.01.2013

43

Cubic Mesh

• PEs are arranged in a cubic fashion

• Each PE has 6 links to neighbors

85

Hypercube

• Dimensions 0-4, recursive definition

86

Hypercubes

■  Diameter: At most log(n)

■  Each node has log(n) neighbours

■  Distance: Number of bit positions differing between the nodes

37

Hypercubes

38

Fat Trees

39

■  Tree structure: □  The distance between any two nodes is no more than 2 log p.

□  Links higher up potentially carry more traffic, bottleneck at root node

□  Can be laid out in 2D with no wire crossings.

■  Fat tree:

□  Fattens the links as we go up the tree.

Systolic Arrays

40

07.01.2013

27

53 ParProg | Hardware PT 2010

Scalable Coherent Interface

• ANSI / IEEE standard for NUMA interconnect, used in HPC world

• 64bit global address space, translation by SCI bus adapter (I/O-window)

• Used as 2D / 3D torus

Processor A Processor B

Cache Cache

Memory

Processor C Processor D

Cache Cache

Memory SCI Cache

SCI Bridge

SCI Cache

SCI Bridge

...

Experimental "Approaches

Systolic Arrays

• Data flow architectures

• Problem: common clock –

maximum signal path

restricted by frequency

• Fault contention: single

faulty processing element

will break entire machine

54

■  Data flow architecture

■  Common clock ■  Maximum signal

path restricted by frequency

■  Single faulty element breaks the complete array

Comparison

Network Diameter Bisection Width

Arc Connectivity

Cost (No. of links)

Completely-connected

Star

Complete binary tree

Linear array

2-D mesh, no wraparound

2-D wraparound mesh

Hypercube

Wraparound k-ary d-cube

Comparison

Network Diameter Bisection Width Arc Connectivity

Cost (No. of links)

Crossbar

Omega Network

Dynamic Tree

Example: Cray T3E

Interconnection network of the Cray T3E: (a) node architecture; (b) network topology.

Example: SGI Origin 3000

Architecture of the SGI Origin 3000 family of servers.

Example: Sun HPC Systems

Architecture of the Sun Enterprise family of servers.

Example: Blue Gene/Q 5D Torus

46

Documents

Parallel Programming Concepts Shared Nothing Parallelism · Parallel Programming Concepts Shared Nothing Parallelism ... LAN protocol, ... because the chips are packaged with two