Upload
vannhi
View
243
Download
0
Embed Size (px)
Citation preview
Parallel Programming Concepts Shared Nothing Parallelism
Dr. Peter Tröger M.Sc. Frank Feinbube
Parallel Processing
■ Inside the processor □ Instruction-level
parallelism (ILP) □ Multicore
□ Shared memory ■ With multiple processing
elements in one machine □ Multiprocessing □ Shared memory
■ With multiple processing elements in many machines □ Multicomputer
□ Shared nothing (in terms of a globally accessible memory)
2
Clusters
■ Collection of stand-alone machines connected by a local network □ Cost-effective technique for a large-scale parallel computer
□ Low cost of both hardware and software □ Users are builders, have control over their own system
(hardware and software), low costs as major issue ■ Distributed processing as extension of DM-MIMD
□ Communication orders of magnitude slower than with SM □ Only feasible for coarse-grained parallel activities
3
Internet Web Server
Web Server
Web Server
Web Server
Load Balancer
Clusters
4
History of Clusters
■ 1977: ARCnet (Datapoint) □ LAN protocol, DATABUS programming language
□ Single computer with terminals □ Addition of ‚compute resource‘ and ‚data resource‘ computers
transparent for the application ■ May 1983: VAXCluster (DEC)
□ Cluster of VAX computers, no single-point-of-failure
□ Every duplicated □ High-speed messaging interconnect
Distributed version of VMS OS ■ Distributed lock manager for shared
resources
5
History of Clusters - NOW
■ Berkeley Network Of Workstations (NOW) - 1995 ■ Building large-scale parallel computing system with COTS
hardware ■ GLUnix operating system
□ Transparent remote execution, network PID‘s
□ Load balancing □ Virtual Node Numbers (for communication)
■ Network RAM - idle machines as paging device ■ Collection of low-latency, parallel communication primitives - ‘active messages’
■ Berkeley sockets, shared address space parallel C, MPI
6
Cluster System Classes
■ High-availability (HA) clusters – Improvement of cluster availability □ Linux-HA project (multi-protocol heartbeat, resource grouping)
■ Load-balancing clusters – Server farm for increased performance / availability □ Linux Virtual Server (IP load & application-level balancing)
■ High-performance computing (HPC) clusters – Increased performance by splitting tasks among different nodes □ Speed up the computation of one distributed job (FLOPS)
■ High-throughput computing (HTC) clusters – Maximize the number of finished jobs □ All kinds of simulations, especially parameter sweep □ Special case: Idle Time Computing for cycle harvesting
7
Massively Parallel Processing (MPP)
■ Hierarchical SIMD / MIMD architecture with a lot of processors □ Still standard components (in contrast to mainframes)
□ Specialized setup of these components □ Host nodes responsible for loading program and data to PE‘s □ High-performance interconnect
(bus, ring, 2D mesh, hypercube, tree, ...) □ For embarrassingly parallel applications, mostly simulations
(atom bomb, climate, earthquake, airplane, car crash, ...) ■ Examples
□ Distributed Array Processor (1979), 64x64 single bit PEs
□ BlueGene/L (2007), 106.496 nodes x 2 PowerPC (700MHz) □ IBM Sequoia (2012), 16,3 PFlops, 1.6 PB memory,
98304 compute nodes, 1.6 Million cores, 7890 kW power
8
BlueGene / L
9
2 Evolution of the IBM System Blue Gene Solution
1.1 View from the outsideThe Blue Gene/P system has the familiar, slanted profile that was introduced with the Blue Gene/L system. However the increased compute power requires an increase in airflow, resulting in a larger footprint. Each of the air plenums on the Blue Gene/P system are just over ten inches wider than the plenums of the previous model. Additionally, each Blue Gene/P rack is approximately four inches wider. There are two additional Bulk Power Modules mounted in the Bulk Power enclosure on the top of the rack. Rather than a circuit breaker style switch, there is an on/off toggle switch to power on the machine.
1.1.1 Packaging
Figure 1-1 illustrates the packaging of the Blue Gene/L system.
Figure 1-1 Blue Gene/L packaging
Figure 1-2 on page 3 shows how the Blue Gene/P system is packaged. The changes start at the lowest point of the chain. Each chip is made up of four processors rather than just two processors like the Blue Gene/L system supports.
At the next level, only one chip is on each of the compute (processor) cards. This design is easier to maintain with less waste. On the Blue Gene/L system, the replacement of a compute node because of a single failed processor requires the discard of one usable chip because the chips are packaged with two per card. The design of the Blue Gene/P system has only one chip per processor card, eliminating the disposal of a good chip when a compute card is replaced.
Each node card still has 32 chips, but now the maximum number of I/O nodes per node card is two, so that only two Ethernet ports are on the front of each node card. Like the Blue Gene/L system, there are two midplanes per rack. The lower midplane is considered to be the
2.8/5.6 GF/s4 MB
2 processors
2 chips, 1x2x1
5.6/11.2 GF/s1.0 GB
(32 chips 4x4x2)16 compute, 0-2 IO cards
90/180 GF/s16 GB
32 node cards
2.8/5.6 TF/s512 GB
64 Racks, 64x32x32
180/360 TF/s32 TB
Rack
System
Node card
Compute card
Chip
Blue Gene/P
10
© 2009 IBM Corporation
Blue Gene/P
13.6 GF/s8 MB EDRAM
4 processors
1 chip, 20 DRAMs
13.6 GF/s2.0 GB DDR2
(4.0GB 6/30/08)
32 Node Cards
13.9 TF/s2 (4) TB
72 Racks, 72x32x32
1 PF/s144 (288) TB
Cabled 8x8x16Rack
System
Compute Card
Chip
435 GF/s64 (128) GB
(32 chips 4x4x2)32 compute, 0-1 IO cards
Node Card
Blue Gene/Q
11
© 2011 IBM Corporation
IBM System Technology Group
1. Chip:16+2 !P
cores
2. Single Chip Module
4. Node Card:32 Compute Cards, Optical Modules, Link Chips; 5D Torus
5a. Midplane: 16 Node Cards
6. Rack: 2 Midplanes
7. System: 96 racks, 20PF/s
3. Compute card:One chip module,16 GB DDR3 Memory,Heat Spreader for H2O Cooling
5b. IO drawer:8 IO cards w/16 GB8 PCIe Gen2 x8 slots3D I/O torus
•Sustained single node perf: 10x P, 20x L• MF/Watt: (6x) P, (10x) L (~2GF/W, Green 500 criteria)• Software and hardware support for programming models for exploitation of node hardware concurrency
Blue Gene/Q
Blue Gene/Q
12
© 2011 IBM Corporation
IBM System Technology Group
! 360 mm² Cu-45 technology (SOI)
! 16 user + 1 service PPC processors – plus 1 redundant processor– all processors are symmetric– 11 metal layer– each 4-way multi-threaded– 64 bits– 1.6 GHz– L1 I/D cache = 16kB/16kB– L1 prefetch engines– each processor has Quad FPU
(4-wide double precision, SIMD)– peak performance 204.8 GFLOPS @ 55 W
! Central shared L2 cache: 32 MB – eDRAM– multiversioned cache – supports transactional memory,
speculative execution.– supports scalable atomic operations
! Dual memory controller– 16 GB external DDR3 memory– 42.6 GB/s DDR3 bandwidth (1.333 GHz DDR3)
(2 channels each with chip kill protection)
! Chip-to-chip networking– 5D Torus topology + external link
" 5 x 2 + 1 high speed serial links– each 2 GB/s send + 2 GB/s receive– DMA, remote put/get, collective operations
! External (file) IO -- when used as IO chip.– PCIe Gen2 x8 interface (4 GB/s Tx + 4 GB/s Rx)– re-uses 2 serial links– interface to Ethernet or Infiniband cards
System-on-a-Chip design : integrates processors, memory and networking logic into a single chip
BlueGene/Q Compute chip
Blue Gene/Q
13
© 2009 IBM Corporation26
Blue Gene/Q System Architecture
Functional Network
10Gb QDR
Functional Network
10Gb QDR
I/O Node
Linux
ciod
C-Node 0
CNK
I/O Node
Linux
ciod
C-Node 0
CNK
C-Node n
CNK
C-Node n
CNK
Control Ethernet
(1Gb)
Control Ethernet
(1Gb)
FPGA
LoadLeveler
SystemConsole
MMCS
JTAG
torus
collective network
DB2
Front-endNodes
FileServers
fs client
fs client
Service Node
app app
appapp
optic
al
optic
al
Blue Gene/Q
14
© 2011 IBM Corporation
IBM System Technology Group
BG/Q Software Stack Openness
New open source community under CPL license. Active IBM participation.Existing open source communities under various licenses. BG code will be contributed and/or new sub-community started..
New open source reference implementation licensed under CPL.
Appl
icat
ion
Syst
emFi
rmw
are
HW
ESSL
MPI Charm++Global Arrays
XL RuntimeGNU Runtime
PAMI(Converged
Messaging Stack)
Compute Node Kernel
(CNK)
CIO Services
Linux kernel
MPI-IO
Diagnostics
Compute nodes I/O nodesUs
er/S
ched
Syst
emFi
rmw
are
HW SN LNs
Low Level Control SystemPower On/Off, Hardware probe,
Hardware init, Parallel monitoringParallel boot, Mailbox
ISVSchedulers,debuggers
Service cardsNode cards
Node SPIs
totalviewd
DB2
Loadleveler
GPFS
BGMon
runjob Sched API
BG Nav
I/O and Compute Nodes Service Nodes/Login Nodes
HPC Toolkit
Closed. No source provided. Not buildable.
Messaging SPIs
High Level Control System (MMCS)Partitioning, Job management and
monitoring, RAS, Administrator interface
Node FirmwareInit, Bootloader, RAS,
Recovery Mailbox
TEAL
XL CompilersGNU Compilers
Diag Harness BGWSBG master
SSNs
MPP Properties
■ Standard components (processors, harddisks, ...) ■ Specific non-standardized interconnection network
□ Low latency, high speed; distributed file system ■ Specific packaging of components and nodes for cooling and
upgradeability □ Whole system provided by one vendor (IBM, HP) □ Extensibility as major issue, in order to save investment □ Distributed processing as extension of DM-MIMD
■ Scalability only limited by application, and not by hardware ■ Proprietary wiring of standard components ■ Demands custom operating system and aligned applications ■ No major consideration of availability
■ Power consumption, cooling
15
Distributed System
■ Tanenbaum (Distributed Operating Systems): „A distributed system is a collection of independent computers that appear to the users of the system as a single computer.“
■ Coulouris et al.: „... [system] in which hardware or software components located at networked computers communicate and coordinate their actions only by passing messages.“
■ Lamport: „A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable.”
■ Consequences: concurrency, no global clock, independent failures ■ Challenges: heterogeneity, openness, security, scalability, failure
handling, concurrency, need for transparency
16
SMP vs. Cluster vs. Distributed System
■ Clusters are composed of computers, SMPs are composed of processors □ High availability is cheaper with clusters, but demands
additional software components □ Scalability is easier with a cluster
□ SMPs are easier to maintain from administrators point of view □ Software licensing becomes more expensive with a cluster
■ Clusters for capability computing, integrated machines for capacity computing
■ Cluster vs. Distributed System □ Both contain of multiple nodes for parallel processing
□ Nodes in a distributed system have their own identity □ Physical vs. virtual organization
17
Comparison
18 MPP SMP Cluster Distributed
Number of nodes O(100)-O(1000) O(10)-O(100) O(100) or less O(10)-O(1000)
Node Complexity Fine grain Medium or coarse grained Medium grain Wide range
Internode communication
Message passing / shared variables (SM)
Centralized and distributed shared
memory Message Passing
Shared files, RPC, Message Passing,
IPC
Job scheduling Single run queue on host
Single run queue mostly
Multiple queues but coordinated Independent queues
SSI support Partially Always in SMP Desired No
Address Space Multiple Single Multiple or single Multiple
Internode Security Irrelevant Irrelevant Required if exposed Required
Ownership One organization One organization One or many organizations Many organizations
K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming; WCB/McGraw-Hill, 1998
Interconnection Networks
■ Shared-Nothing systems demand structured connectivity □ Processor-to-processor interaction
□ Processor-to-memory interaction ■ Static network
□ Point-to-point links, fixed route
■ Dynamic network □ Consists of links
and switching elements
□ Flexible configuration of processor interaction
19
Interconnection Networks
20
Interconnection Networks
■ Dynamic networks are built from a graph of configurable switching elements
■ General packet switching network counts as irregular static network
21
[Peter New
man]
Interconnection Networks
■ Network Interfaces □ Processors talk to the network via a network interface
connector (NIC) hardware □ Network interfaces attached to the interconnect
◊ Cluster vs. tightly-coupled multi-computer □ Next generation hardware will include NIC in the processor die
■ Switching elements map a fixed number of inputs to outputs □ Total number of ports is the degree of the switch
□ The cost of a switch grows as square of the degree □ The peripheral hardware grows linearly as the degree
Interconnection Networks
■ A variety of network topologies proposed and implemented ■ Each topology has a performance / cost tradeoff
■ Commercial machines often implement hybrids □ Optimize packaging and costs
■ Metrics for an interconnection network graph □ Diameter: Maximum distance between any two nodes
□ Connectivity: Minimum number of edges that must be removed to get two independent graphs
□ Link width / weight: Transfer capacity of an edge □ Bisection width: Minimum transfer capacity given between
any two halves of the graph □ Costs: Number of edges in the network
■ Often optimization for connectivity metric
Bus Systems
■ Static interconnect technology ■ Shared communication path, broadcasting of information
□ Diameter: O(1) □ Connectivity: O(1) □ Bisection width: O(1) □ Costs: O(p)
24
…
Crossbar Switch
■ Dynamic switch-based network ■ Non-blocking
■ Supports multiple connections without collisions
■ Diameter: O(1) ■ Connectivity: O(1) ■ Bisection width: O(n)
■ Costs: O(n2) □ High costs with
quadratic growth, bad scalability
■ N*(n-1) connection points
25 07.01.2013
40
Crossbar switch "(Kreuzschienenverteiler)
• Arbitrary number of
permutations
• Collision-free data
exchange
• High cost, quadratic
growth
• n * (n-1) connection points
79
Delta networks
• Only n/2 log n delta-
switches
• Limited cost
• Not all possible
permutations
operational in parallel
80
07.01.2013
40
Crossbar switch "(Kreuzschienenverteiler)
• Arbitrary number of
permutations
• Collision-free data
exchange
• High cost, quadratic
growth
• n * (n-1) connection points
79
Delta networks
• Only n/2 log n delta-
switches
• Limited cost
• Not all possible
permutations
operational in parallel
80
Crossbar Switch
26
Crossbar Switch
27
Multistage Interconnection Networks
■ Connection by switching elements ■ Typical solution to connect processing and memory elements
■ Can implement sorting or shuffling in the network routing
28
Omega Network
29 ■ Inputs are crossed or not, depending on routing logic □ Destination-tag routing: Use positional bit for switch decision
□ XOR-tag routing: Use positional bit of XOR result for decision ■ For N PE’s, N/2 switches per stage, log2N stages ■ Decrease bottleneck probability on parallel communication
Delta Networks
■ Stage n checks bit k of the destination tag
■ Only (n/2 * log n) delta switches needed
■ Limited cost
■ Not all possible permutations operational in parallel
■ Possible effect of ‚output port contention‘ and ‚path contention‘
30 0 1
2 3
4
5
6 7
Close Coupling – Delta Networks and Crossbar
31
07.01.2013
41
Clos coupling networks
• Combination of delta
network and crossbar
81
C.Clos, A Study of Nonblocking Switching Networks, "
Bell System Technical Journal, vol. 32, no. 2, "1953, pp. 406-424(19)
Fat-Tree networks
• PEs arranged as leafs on a binary tree
• Capacity of tree (links) doubles on each layer
82
Bitonic Mergesort
32
Completety Connected / Star Connected Networks
33
Cartesian Topology Network
34
Linear Arrays
2D and 3D Meshes
Cartesian Topology Network
■ Linear array: Each node has two neighbours ■ 1D torus / ring: Linear array with connected endings
■ 2D torus / mesh: Each node has four neighbours ■ d-dimensional mesh: Nodes with 2d neighbours ■ Hypercube
□ d-dimensional mesh where d=log n (# processors)
□ Construction of hypercube from lower dimensional hypercube
35
07.01.2013
42
Point-to-point networks: "ring and fully connected graph
• Ring has only two connections per PE (almost optimal)
• Fully connected graph – optimal connectivity (but high cost)
83
Mesh and Torus
• Compromise between cost and connectivity
84
4-way 2D mesh 4-way 2D torus 8-way 2D mesh
Hypercubes
36
07.01.2013
43
Cubic Mesh
• PEs are arranged in a cubic fashion
• Each PE has 6 links to neighbors
85
Hypercube
• Dimensions 0-4, recursive definition
86
Hypercubes
■ Diameter: At most log(n)
■ Each node has log(n) neighbours
■ Distance: Number of bit positions differing between the nodes
37
Hypercubes
38
Fat Trees
39
■ Tree structure: □ The distance between any two nodes is no more than 2 log p.
□ Links higher up potentially carry more traffic, bottleneck at root node
□ Can be laid out in 2D with no wire crossings.
■ Fat tree:
□ Fattens the links as we go up the tree.
Systolic Arrays
40
07.01.2013
27
53 ParProg | Hardware PT 2010
Scalable Coherent Interface
• ANSI / IEEE standard for NUMA interconnect, used in HPC world
• 64bit global address space, translation by SCI bus adapter (I/O-window)
• Used as 2D / 3D torus
Processor A Processor B
Cache Cache
Memory
Processor C Processor D
Cache Cache
Memory SCI Cache
SCI Bridge
SCI Cache
SCI Bridge
...
Experimental "Approaches
Systolic Arrays
• Data flow architectures
• Problem: common clock –
maximum signal path
restricted by frequency
• Fault contention: single
faulty processing element
will break entire machine
54
■ Data flow architecture
■ Common clock ■ Maximum signal
path restricted by frequency
■ Single faulty element breaks the complete array
Comparison
Network Diameter Bisection Width
Arc Connectivity
Cost (No. of links)
Completely-connected
Star
Complete binary tree
Linear array
2-D mesh, no wraparound
2-D wraparound mesh
Hypercube
Wraparound k-ary d-cube
Comparison
Network Diameter Bisection Width Arc Connectivity
Cost (No. of links)
Crossbar
Omega Network
Dynamic Tree
Example: Cray T3E
Interconnection network of the Cray T3E: (a) node architecture; (b) network topology.
Example: SGI Origin 3000
Architecture of the SGI Origin 3000 family of servers.
Example: Sun HPC Systems
Architecture of the Sun Enterprise family of servers.
Example: Blue Gene/Q 5D Torus
46