1 Cluster Computing Cheng-Zhong Xu. 2 Outline Cluster Computing Basics –Multicore architecture...

Preview:

Citation preview

1

Cluster Computing

Cheng-Zhong Xu

2

Outline

Cluster Computing Basics– Multicore architecture

– Cluster Interconnect

Parallel Programming for Performance MapReduce Programming Systems Management

3

What’s a Cluster?

Broadly, a group of networked autonomous computers that work together to form a single machine in many respects:

– To improve performance (speed)

– To improve throughout

– To improve service availability (high-availability clusters)

Based on commercial off-the-shelf, the system is often more cost-effective than single machine with comparable speed or availability

4

Highly Scalable Clusters

High Performance Cluster (aka Compute Cluster)– A form of parallel computers, which aims to solve

problems faster by using multiple compute nodes.

– For parallel efficiency, the nodes are often closely coupled in a high throughput, low-latency network

Server Cluster and Datacenter– Aims to improve the system’s throughput , service

availability, power consumption, etc by using multiple nodes

5

Top500 Installation of Supercomputers

Top500.com

6

Clusters in Top500

7

An Example of Top500 Submission (F’08)

Location Tukwila, WA

Hardware – Machines 256 Dual-CPU, quad-core Intel 5320 Clovertown 1.86GHz CPU and 8GB RAM

Hardware – Networking Private & Public: Broadcom GigEMPI: Cisco Infiniband SDR, 34 IB switches in leaf/node configuration

Number of Compute Nodes 256

Total Number of Cores 2048

Total Memory 2 TB of RAM

Particulars of for current Linpack Runs

Best Linpack Result 11.75 TFLOPS

Best Cluster Efficiency 77.1%

For Comparison…

Linpack rating from June2007 Top500 run (#106) on the same hardware

8.99 TFLOPS

Cluster efficiency from June2007 Top500 run (#106) on the same hardware

59%

Typical Top500 efficiency for Clovertown motherboards w/ IB regardless of Operating System

65-77% (2 instances of 79%)

30% impro in efficiency on the same hardware; about one hour to deplay

8

Beowulf Cluster A cluster of inexpensive PCs for low-cost personal

supercomputing Based on commodity off-the-shelf components:

– PC computers running a Unix-like Os (BSD, Linux, or OpenSolaris)

– Interconnected by an Ethernet LAN Head node, plus a group of compute node

– Head node controls the cluster, and serves files to the compute nodes

Standard, free and open source software– Programming in MPI– MapReduce

9

Why Clustering Today

Powerful node (cpu, mem, storage)– Today’s PC is yesterday’s supercomputers

– Multi-core processors

High speed network– Gigabit (56% in top500 as of Nov 2008)

– Infiniband System Area Network (SAN) (24.6%)

Standard tools for parallel/ distributed computing & their growing popularity.– MPI, PBS, etc

– MapReduce for data-intensive computing

10

Major issues in Cluster Design Programmability

– Sequential vs Parallel Programming

– MPI, DSM, DSA: hybrid of multithreading and MPI

– MapReduce

Cluster-aware Resource management – Job scheduling (e.g. PBS)

– Load balancing, data locality, communication opt, etc

System management– Remote installation, monitoring, diagnosis,

– Failure management, power management, etc

11

Cluster Architecture

Multi-core node architecture Cluster Interconnect

12

Single-core computer

13

Single-core CPU chip

the single core

14

Multicore Architecture

Combine 2 or more independent cores (normally CPU) into a single package

Support multitasking and multithreading in a single physical package

15

Multicore is Everywhere

Dual-core commonplace in laptops Quad-core in desktops Dual quad-core in servers All major chip manufacturers produce multicore

CPUs– SUN Niagara (8 cores, 64 concurrent threads)– Intel Xeon (6 cores)– AMD Opteron (4 cores)

16

Multithreading on multi-core

David Geer, IEEE Computer, 2007

17

Interaction with the OS

OS perceives each core as a separate processor

OS scheduler maps threads/processes to different cores

Most major OS support multi-core today:Windows, Linux, Mac OS X, …

18

Cluster Interconnect

Network fabric connecting the compute nodes Objective is to strike a balance between

– Processing power of compute nodes

– Communication ability of the interconnect

A more specialized LAN, providing many opportunities for perf. optimization

– Switch in the core

– Latency vs bwCross-bar

InputBuffer

Control

OutputPorts

Input Receiver Transmiter

Ports

Routing, Scheduling

OutputBuffer

19

Goal: Bandwidth and Latency

0

10

20

30

40

50

60

70

80

0 0.2 0.4 0.6 0.8 1

Delivered Bandwidth

Lat

ency

Saturation

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 0.2 0.4 0.6 0.8 1 1.2

Offered BandwidthD

eliv

ered

Ban

dw

idth

Saturation

20

Ethernet Switch: allows multiple simultaneous transmissions

hosts have dedicated, direct connection to switch

switches buffer packets Ethernet protocol used on each

incoming link, but no collisions; full duplex

– each link is its own collision domain

switching: A-to-A’ and B-to-B’ simultaneously, without collisions

– not possible with dumb hub

A

A’

B

B’

C

C’

switch with six interfaces(1,2,3,4,5,6)

1 23

45

6

21

Switch Table

Q: how does switch know that A’ reachable via interface 4, B’ reachable via interface 5?

A: each switch has a switch table, each entry:

– (MAC address of host, interface to reach host, time stamp)

looks like a routing table! Q: how are entries created,

maintained in switch table? – something like a routing protocol?

A

A’

B

B’

C

C’

switch with six interfaces(1,2,3,4,5,6)

1 23

45

6

22

Switch: self-learning

switch learns which hosts can be reached through which interfaces

– when frame received, switch “learns” location of sender: incoming LAN segment

– records sender/location pair in switch table

A

A’

B

B’

C

C’

1 2 345

6

A A’

Source: A

Dest: A’

MAC addr interface TTL

Switch table (initially empty)

A 1 60

23

Self-learning, forwarding: example

A

A’

B

B’

C

C’

1 23

45

6

A A’

Source: A

Dest: A’

MAC addr interface TTL

Switch table (initially empty)

A 1 60

A A’A A’A A’A A’A A’

frame destination unknown:flood

A’ A

destination A location known:

A’ 4 60

selective send

24

Interconnecting switches

Switches can be connected together

A

B

Q: sending from A to G - how does S1 know to forward frame destined to F via S4 and S3?

A: self learning! (works exactly the same as in single-switch case!)

Q: Latency and Bandwidth for a large-scale network?

S1

C D

E

FS2

S4

S3

HI

G

25

What characterizes a network?

Topology (what)– physical interconnection structure of the network graph

– Regular vs irregular

Routing Algorithm (which)– restricts the set of paths that msgs may follow

– Table-driven, or routing algorithm based

Switching Strategy (how)– how data in a msg traverses a route

– Store and forward vs cut-through

Flow Control Mechanism (when)– when a msg or portions of it traverse a route

– what happens when traffic is encountered?

Interplay of all of these determines performance

26

Tree: An Example

Diameter and ave distance logarithmic– k-ary tree, height d = logk N– address specified d-vector of radix k coordinates describing path down from root

Fixed degree Route up to common ancestor and down

– R = B xor A– let i be position of most significant 1 in R, route up i+1 levels– down in direction given by low i+1 bits of B

Bandwidth and Bisection BW?

27

Bandwidth Bandwidth

– Point-to-Point bandwidth

– Bisectional bandwidth of interconnect frabric: rate of data that can be sent across an imaginary line dividing the cluster into two halves each with equal number of ndoes.

For a switch with N ports,– If it is non-blocking, the bisectional bandwidth = N * the p-t-p

bandwidth

– Oversubscribed switch delivers less bisectional bandwidth than non-blocking, but cost-effective. It scales the bw per node up to a point after which the increase in number of nodes decreases the available bw per node

– oversubscription is the ratio of the worst-case achievable aggregate bw among the end hosts to the total bisection bw

28

How to Maintain Constant BW per Node?

Limited ports in a single switch– Multiple switches

Link between a pair of switches be bottleneck– Fast uplink

How to organize multiple switches – Irregular topology

– Regular topologies: ease of management

29

Scalable Interconnect: Examples

0

1

2

3

4

16 node butterfly

0 1 0 1

0 1 0 1

0 1

building block

Fat Tree

Fat Tree

30

Multidimensional Meshes and Tori

d-dimensional array– n = kd-1 X ...X kO nodes

– described by d-vector of coordinates (id-1, ..., iO)

d-dimensional k-ary mesh: N = kd

– k = dN– described by d-vector of radix k coordinate

d-dimensional k-ary torus (or k-ary d-cube)?

2D Mesh 3D Cube2D torus

31

Packet Switching Strategies Store and Forward (SF)

– move entire packet one hop toward destination– buffer till next hop permitted

Virtual Cut-Through and Wormhole– pipeline the hops: switch examines the header,

decides where to send the message, and then starts forwarding it immediately

– Virtual Cut-Through: buffer on blockage– Wormhole: leave message spread through network

on blockage

32

SF vs WH (VCT) Switching

Unloaded latencyh( n/b+ vs n/b+h– h: distance– n: size of message– b: bandwidth : additional routing delay per hop

23 1 0

23 1 0

23 1 0

23 1 0

23 1 0

23 1 0

23 1 0

23 1 0

23 1 0

23 1 0

23 1 0

23 1

023

3 1 0

2 1 0

23 1 0

0

1

2

3

23 1 0Time

Store & Forward Routing Cut-Through Routing

Source Dest Dest

33

Conventional Datacenter Network

34

Problems with the Architecture

Resource fragmentation: – If an application grows and requires more servers, it

cannot use available servers in other layer 2 domains, resulting in fragmentation and underutilization of resources

Power server-to-server connectivity– Servers in different layer-2 domains to communication

through the layer-3 portion of the network

See papers in the reading list of Datacenter Network Design for proposed approaches

35

Parallel Programming for Performance

36

Steps in Creating a Parallel Program

4 steps: Decomposition, Assignment, Orchestration, Mapping– Done by programmer or system software (compiler, runtime, ...)– Issues are the same, so assume programmer does it all explicitly

P0

Tasks Processes Processors

P1

P2 P3

p0 p1

p2 p3

p0 p1

p2 p3

Partitioning

Sequentialcomputation

Parallelprogram

Assignment

Decomposition

Mapping

Orchestration

37

Some Important Concepts

Task: – Arbitrary piece of undecomposed work in parallel

computation– Executed sequentially; concurrency is only across tasks– Fine-grained versus coarse-grained tasks

Process (thread): – Abstract entity that performs the tasks assigned to processes– Processes communicate and synchronize to perform their

tasks Processor:

– Physical engine on which process executes– Processes virtualize machine to programmer

• first write program in terms of processes, then map to processors

38

Decomposition

Break up computation into tasks to be divided among processes

– Tasks may become available dynamically

– No. of available tasks may vary with time

Identify concurrency and decide level at which to exploit it

Goal: Enough tasks to keep processes busy, but not too many

– No. of tasks available at a time is upper bound on achievable speedup

39

Assignment Specifying mechanism to divide work up among processes

– Together with decomposition, also called partitioning

– Balance workload, reduce communication and management cost

Structured approaches usually work well– Code inspection (parallel loops) or understanding of application

– Well-known heuristics

– Static versus dynamic assignment

As programmers, we worry about partitioning first– Usually independent of architecture or prog model

– But cost and complexity of using primitives may affect decisions

As architects, we assume program does reasonable job of it

40

Orchestration

– Naming data– Structuring communication– Synchronization – Organizing data structures and scheduling tasks temporally

Goals– Reduce cost of communication and synch. as seen by processors– Reserve locality of data reference (incl. data structure organization)– Schedule tasks to satisfy dependences early– Reduce overhead of parallelism management

Closest to architecture (and programming model & language)– Choices depend a lot on comm. abstraction, efficiency of primitives – Architects should provide appropriate primitives efficiently

41

Orchestration (cont’)

Shared address space– Shared and private data explicitly separate

– Communication implicit in access patterns

– No correctness need for data distribution

– Synchronization via atomic operations on shared data

– Synchronization explicit and distinct from data communication

Message passing– Data distribution among local address spaces needed

– No explicit shared structures (implicit in comm. patterns)

– Communication is explicit

– Synchronization implicit in communication (at least in synch. Case)

42

Mapping After orchestration, already have parallel program

Two aspects of mapping:– Which processes/threads will run on same processor (core), if necessary

– Which process/thread runs on which particular processor (core)

• mapping to a network topology

One extreme: space-sharing– Machine divided into subsets, only one app at a time in a subset

– Processes can be pinned to processors, or left to OS

Another extreme: leave resource management control to OS Real world is between the two

– User specifies desires in some aspects, system may ignore

Usually adopt the view: process <-> processor

43

Basic Trade-offs for Performance

44

Trade-offs Load Balance

– fine grain tasks– random or dynamic assignment

Parallelism Overhead– coarse grain tasks– simple assignment

Communication– decompose to obtain locality– recompute from local data– big transfers – amortize overhead and latency– small transfers – reduce overhead and contention

45

Load Balancing in HPC

Based on notes of James Demmel and David Culler

46

LB in Parallel and Distributed Systems

Load balancing problems differ in: Tasks costs

– Do all tasks have equal costs?

– If not, when are the costs known?• Before starting, when task created, or only when task ends

Task dependencies– Can all tasks be run in any order (including parallel)?

– If not, when are the dependencies known?• Before starting, when task created, or only when task ends

Locality– Is it important for some tasks to be scheduled on the same processor

(or nearby) to reduce communication cost?

– When is the information about communication between tasks known?

47

Task cost spectrum

48

Task Dependency Spectrum

49

Task Locality Spectrum (Data Dependencies)

50

Spectrum of Solutions

One of the key questions is when certain information about the load balancing problem is known

Leads to a spectrum of solutions: Static scheduling. All information is available to scheduling

algorithm, which runs before any real computation starts. (offline algorithms)

Semi-static scheduling. Information may be known at program startup, or the beginning of each timestep, or at other well-defined points. Offline algorithms may be used even though the problem is dynamic.

Dynamic scheduling. Information is not known until mid-execution. (online algorithms)

51

Representative Approaches

Static load balancing Semi-static load balancing Self-scheduling Distributed task queues Diffusion-based load balancing DAG scheduling Mixed Parallelism

52

Self-Scheduling Basic Ideas:

– Keep a centralized pool of tasks that are available to run

– When a processor completes its current task, look at the pool

– If the computation of one task generates more, add them to the pool

It is useful, when– A batch (or set) of tasks without dependencies

– The cost of each task is unknown

– Locality is not important

– Using a shared memory multiprocessor, so a centralized pool of tasks is fine (How about on a distributed memory system like clusters?)

53

Cluster Management

54

Rocks Cluster Distribution: An Example www.rocksclusters.org Based on CentOS Linux Mass installation is a core part of the system

– Mass re-installation for application-specific config.

Front-end central server + compute & storage nodes Rolls: collection of packages

– Base roll includes: PBS (portable batch system), PVM (parallel virtual machine), MPI (message passing interface), job launchers, …

– Rolls ver 5.1: support for virtual clusters, virtual front ends, virtual compute nodes

55

Microsoft HPC Server 2008: Another example Windows Server 2008 + clustering package Systems Management

– Management Console: plug-in to System Center UI with support for Windows PowerShell

– RIS (Remote Installation Service)

Networking– MS-MPI (Message Passing Interface)

– ICS (Internet Connection Sharing) : NAT for cluster nodes– Network Direct RDMA (Remote DMA)

Job scheduler Storage: iSCSI SAN and SMB support Failover support

Microsoft’s Productivity Vision for HPC

AdministratoAdministratorr

Application Application DeveloperDeveloper End - UserEnd - User

Integrated Turnkey HPC Cluster Solution

Simplified Setup and Deployment

Built-In Diagnostics Efficient Cluster

Utilization Integrates with IT

Infrastructure and Policies

Integrated Tools for Parallel Programming

Highly Productive Parallel Programming Frameworks

Service-Oriented HPC Applications

Support for Key HPC Development Standards

Unix Application Migration

Seamless Integration with Workstation Applications

Integration with Existing Collaboration and Workflow Solutions

Secure Job Execution and Data Access

Windows HPC allows you to accomplish more, in less time, with reduced effort by leveraging users existing skills and integrating

with the tools they are already using.