1 Cluster Computing Cheng-Zhong Xu. 2 Outline Cluster Computing Basics –Multicore architecture...

Cluster Computing

Cheng-Zhong Xu

Outline

Cluster Computing Basics– Multicore architecture

– Cluster Interconnect

Parallel Programming for Performance MapReduce Programming Systems Management

What’s a Cluster?

Broadly, a group of networked autonomous computers that work together to form a single machine in many respects:

– To improve performance (speed)

– To improve throughout

– To improve service availability (high-availability clusters)

Based on commercial off-the-shelf, the system is often more cost-effective than single machine with comparable speed or availability

Highly Scalable Clusters

High Performance Cluster (aka Compute Cluster)– A form of parallel computers, which aims to solve

problems faster by using multiple compute nodes.

– For parallel efficiency, the nodes are often closely coupled in a high throughput, low-latency network

Server Cluster and Datacenter– Aims to improve the system’s throughput , service

availability, power consumption, etc by using multiple nodes

Top500 Installation of Supercomputers

Top500.com

Clusters in Top500

An Example of Top500 Submission (F’08)

Location Tukwila, WA

Hardware – Machines 256 Dual-CPU, quad-core Intel 5320 Clovertown 1.86GHz CPU and 8GB RAM

Hardware – Networking Private & Public: Broadcom GigEMPI: Cisco Infiniband SDR, 34 IB switches in leaf/node configuration

Number of Compute Nodes 256

Total Number of Cores 2048

Total Memory 2 TB of RAM

Particulars of for current Linpack Runs

Best Linpack Result 11.75 TFLOPS

Best Cluster Efficiency 77.1%

For Comparison…

Linpack rating from June2007 Top500 run (#106) on the same hardware

8.99 TFLOPS

Cluster efficiency from June2007 Top500 run (#106) on the same hardware

Typical Top500 efficiency for Clovertown motherboards w/ IB regardless of Operating System

65-77% (2 instances of 79%)

30% impro in efficiency on the same hardware; about one hour to deplay

Beowulf Cluster A cluster of inexpensive PCs for low-cost personal

supercomputing Based on commodity off-the-shelf components:

– PC computers running a Unix-like Os (BSD, Linux, or OpenSolaris)

– Interconnected by an Ethernet LAN Head node, plus a group of compute node

– Head node controls the cluster, and serves files to the compute nodes

Standard, free and open source software– Programming in MPI– MapReduce

Why Clustering Today

Powerful node (cpu, mem, storage)– Today’s PC is yesterday’s supercomputers

– Multi-core processors

High speed network– Gigabit (56% in top500 as of Nov 2008)

– Infiniband System Area Network (SAN) (24.6%)

Standard tools for parallel/ distributed computing & their growing popularity.– MPI, PBS, etc

– MapReduce for data-intensive computing

Major issues in Cluster Design Programmability

– Sequential vs Parallel Programming

– MPI, DSM, DSA: hybrid of multithreading and MPI

– MapReduce

Cluster-aware Resource management – Job scheduling (e.g. PBS)

– Load balancing, data locality, communication opt, etc

System management– Remote installation, monitoring, diagnosis,

– Failure management, power management, etc

Cluster Architecture

Multi-core node architecture Cluster Interconnect

Single-core computer

Single-core CPU chip

the single core

Multicore Architecture

Combine 2 or more independent cores (normally CPU) into a single package

Support multitasking and multithreading in a single physical package

Multicore is Everywhere

Dual-core commonplace in laptops Quad-core in desktops Dual quad-core in servers All major chip manufacturers produce multicore

CPUs– SUN Niagara (8 cores, 64 concurrent threads)– Intel Xeon (6 cores)– AMD Opteron (4 cores)

Multithreading on multi-core

David Geer, IEEE Computer, 2007

Interaction with the OS

OS perceives each core as a separate processor

OS scheduler maps threads/processes to different cores

Most major OS support multi-core today:Windows, Linux, Mac OS X, …

Cluster Interconnect

Network fabric connecting the compute nodes Objective is to strike a balance between

– Processing power of compute nodes

– Communication ability of the interconnect

A more specialized LAN, providing many opportunities for perf. optimization

– Switch in the core

– Latency vs bwCross-bar

InputBuffer

Control

OutputPorts

Input Receiver Transmiter

Routing, Scheduling

OutputBuffer

Goal: Bandwidth and Latency

0 0.2 0.4 0.6 0.8 1

Delivered Bandwidth

Saturation

0 0.2 0.4 0.6 0.8 1 1.2

Offered BandwidthD

Saturation

Ethernet Switch: allows multiple simultaneous transmissions

hosts have dedicated, direct connection to switch

switches buffer packets Ethernet protocol used on each

incoming link, but no collisions; full duplex

– each link is its own collision domain

switching: A-to-A’ and B-to-B’ simultaneously, without collisions

– not possible with dumb hub

switch with six interfaces(1,2,3,4,5,6)

Switch Table

Q: how does switch know that A’ reachable via interface 4, B’ reachable via interface 5?

A: each switch has a switch table, each entry:

– (MAC address of host, interface to reach host, time stamp)

looks like a routing table! Q: how are entries created,

maintained in switch table? – something like a routing protocol?

switch with six interfaces(1,2,3,4,5,6)

Switch: self-learning

switch learns which hosts can be reached through which interfaces

– when frame received, switch “learns” location of sender: incoming LAN segment

– records sender/location pair in switch table

1 2 345

A A’

Source: A

Dest: A’

MAC addr interface TTL

Switch table (initially empty)

A 1 60

Self-learning, forwarding: example

A A’

Source: A

Dest: A’

MAC addr interface TTL

Switch table (initially empty)

A 1 60

A A’A A’A A’A A’A A’

frame destination unknown:flood

A’ A

destination A location known:

A’ 4 60

selective send

Interconnecting switches

Switches can be connected together

Q: sending from A to G - how does S1 know to forward frame destined to F via S4 and S3?

A: self learning! (works exactly the same as in single-switch case!)

Q: Latency and Bandwidth for a large-scale network?

What characterizes a network?

Topology (what)– physical interconnection structure of the network graph

– Regular vs irregular

Routing Algorithm (which)– restricts the set of paths that msgs may follow

– Table-driven, or routing algorithm based

Switching Strategy (how)– how data in a msg traverses a route

– Store and forward vs cut-through

Flow Control Mechanism (when)– when a msg or portions of it traverse a route

– what happens when traffic is encountered?

Interplay of all of these determines performance

Tree: An Example

Diameter and ave distance logarithmic– k-ary tree, height d = logk N– address specified d-vector of radix k coordinates describing path down from root

Fixed degree Route up to common ancestor and down

– R = B xor A– let i be position of most significant 1 in R, route up i+1 levels– down in direction given by low i+1 bits of B

Bandwidth and Bisection BW?

Bandwidth Bandwidth

– Point-to-Point bandwidth

– Bisectional bandwidth of interconnect frabric: rate of data that can be sent across an imaginary line dividing the cluster into two halves each with equal number of ndoes.

For a switch with N ports,– If it is non-blocking, the bisectional bandwidth = N * the p-t-p

bandwidth

– Oversubscribed switch delivers less bisectional bandwidth than non-blocking, but cost-effective. It scales the bw per node up to a point after which the increase in number of nodes decreases the available bw per node

– oversubscription is the ratio of the worst-case achievable aggregate bw among the end hosts to the total bisection bw

How to Maintain Constant BW per Node?

Limited ports in a single switch– Multiple switches

Link between a pair of switches be bottleneck– Fast uplink

How to organize multiple switches – Irregular topology

– Regular topologies: ease of management

Scalable Interconnect: Examples

16 node butterfly

0 1 0 1

building block

Fat Tree

Multidimensional Meshes and Tori

d-dimensional array– n = kd-1 X ...X kO nodes

– described by d-vector of coordinates (id-1, ..., iO)

d-dimensional k-ary mesh: N = kd

– k = dN– described by d-vector of radix k coordinate

d-dimensional k-ary torus (or k-ary d-cube)?

2D Mesh 3D Cube2D torus

Packet Switching Strategies Store and Forward (SF)

– move entire packet one hop toward destination– buffer till next hop permitted

Virtual Cut-Through and Wormhole– pipeline the hops: switch examines the header,

decides where to send the message, and then starts forwarding it immediately

– Virtual Cut-Through: buffer on blockage– Wormhole: leave message spread through network

on blockage

SF vs WH (VCT) Switching

Unloaded latencyh( n/b+ vs n/b+h– h: distance– n: size of message– b: bandwidth : additional routing delay per hop

23 1 0

23 1 0Time

Store & Forward Routing Cut-Through Routing

Source Dest Dest

Conventional Datacenter Network

Problems with the Architecture

Resource fragmentation: – If an application grows and requires more servers, it

cannot use available servers in other layer 2 domains, resulting in fragmentation and underutilization of resources

Power server-to-server connectivity– Servers in different layer-2 domains to communication

through the layer-3 portion of the network

See papers in the reading list of Datacenter Network Design for proposed approaches

Parallel Programming for Performance

Steps in Creating a Parallel Program

4 steps: Decomposition, Assignment, Orchestration, Mapping– Done by programmer or system software (compiler, runtime, ...)– Issues are the same, so assume programmer does it all explicitly

Tasks Processes Processors

Partitioning

Sequentialcomputation

Parallelprogram

Assignment

Decomposition

Mapping

Orchestration

Some Important Concepts

Task: – Arbitrary piece of undecomposed work in parallel

computation– Executed sequentially; concurrency is only across tasks– Fine-grained versus coarse-grained tasks

Process (thread): – Abstract entity that performs the tasks assigned to processes– Processes communicate and synchronize to perform their

tasks Processor:

– Physical engine on which process executes– Processes virtualize machine to programmer

• first write program in terms of processes, then map to processors

Decomposition

Break up computation into tasks to be divided among processes

– Tasks may become available dynamically

– No. of available tasks may vary with time

Identify concurrency and decide level at which to exploit it

Goal: Enough tasks to keep processes busy, but not too many

– No. of tasks available at a time is upper bound on achievable speedup

Assignment Specifying mechanism to divide work up among processes

– Together with decomposition, also called partitioning

– Balance workload, reduce communication and management cost

Structured approaches usually work well– Code inspection (parallel loops) or understanding of application

– Well-known heuristics

– Static versus dynamic assignment

As programmers, we worry about partitioning first– Usually independent of architecture or prog model

– But cost and complexity of using primitives may affect decisions

As architects, we assume program does reasonable job of it

Orchestration

– Naming data– Structuring communication– Synchronization – Organizing data structures and scheduling tasks temporally

Goals– Reduce cost of communication and synch. as seen by processors– Reserve locality of data reference (incl. data structure organization)– Schedule tasks to satisfy dependences early– Reduce overhead of parallelism management

Closest to architecture (and programming model & language)– Choices depend a lot on comm. abstraction, efficiency of primitives – Architects should provide appropriate primitives efficiently

Orchestration (cont’)

Shared address space– Shared and private data explicitly separate

– Communication implicit in access patterns

– No correctness need for data distribution

– Synchronization via atomic operations on shared data

– Synchronization explicit and distinct from data communication

Message passing– Data distribution among local address spaces needed

– No explicit shared structures (implicit in comm. patterns)

– Communication is explicit

– Synchronization implicit in communication (at least in synch. Case)

Mapping After orchestration, already have parallel program

Two aspects of mapping:– Which processes/threads will run on same processor (core), if necessary

– Which process/thread runs on which particular processor (core)

• mapping to a network topology

One extreme: space-sharing– Machine divided into subsets, only one app at a time in a subset

– Processes can be pinned to processors, or left to OS

Another extreme: leave resource management control to OS Real world is between the two

– User specifies desires in some aspects, system may ignore

Usually adopt the view: process <-> processor

Basic Trade-offs for Performance

Trade-offs Load Balance

– fine grain tasks– random or dynamic assignment

Parallelism Overhead– coarse grain tasks– simple assignment

Communication– decompose to obtain locality– recompute from local data– big transfers – amortize overhead and latency– small transfers – reduce overhead and contention

Load Balancing in HPC

Based on notes of James Demmel and David Culler

LB in Parallel and Distributed Systems

Load balancing problems differ in: Tasks costs

– Do all tasks have equal costs?

– If not, when are the costs known?• Before starting, when task created, or only when task ends

Task dependencies– Can all tasks be run in any order (including parallel)?

– If not, when are the dependencies known?• Before starting, when task created, or only when task ends

Locality– Is it important for some tasks to be scheduled on the same processor

(or nearby) to reduce communication cost?

– When is the information about communication between tasks known?

Task cost spectrum

Task Dependency Spectrum

Task Locality Spectrum (Data Dependencies)

Spectrum of Solutions

One of the key questions is when certain information about the load balancing problem is known

Leads to a spectrum of solutions: Static scheduling. All information is available to scheduling

algorithm, which runs before any real computation starts. (offline algorithms)

Semi-static scheduling. Information may be known at program startup, or the beginning of each timestep, or at other well-defined points. Offline algorithms may be used even though the problem is dynamic.

Dynamic scheduling. Information is not known until mid-execution. (online algorithms)

Representative Approaches

Static load balancing Semi-static load balancing Self-scheduling Distributed task queues Diffusion-based load balancing DAG scheduling Mixed Parallelism

Self-Scheduling Basic Ideas:

– Keep a centralized pool of tasks that are available to run

– When a processor completes its current task, look at the pool

– If the computation of one task generates more, add them to the pool

It is useful, when– A batch (or set) of tasks without dependencies

– The cost of each task is unknown

– Locality is not important

– Using a shared memory multiprocessor, so a centralized pool of tasks is fine (How about on a distributed memory system like clusters?)

Cluster Management

Rocks Cluster Distribution: An Example www.rocksclusters.org Based on CentOS Linux Mass installation is a core part of the system

– Mass re-installation for application-specific config.

Front-end central server + compute & storage nodes Rolls: collection of packages

– Base roll includes: PBS (portable batch system), PVM (parallel virtual machine), MPI (message passing interface), job launchers, …

– Rolls ver 5.1: support for virtual clusters, virtual front ends, virtual compute nodes

Microsoft HPC Server 2008: Another example Windows Server 2008 + clustering package Systems Management

– Management Console: plug-in to System Center UI with support for Windows PowerShell

– RIS (Remote Installation Service)

Networking– MS-MPI (Message Passing Interface)

– ICS (Internet Connection Sharing) : NAT for cluster nodes– Network Direct RDMA (Remote DMA)

Job scheduler Storage: iSCSI SAN and SMB support Failover support

Microsoft’s Productivity Vision for HPC

AdministratoAdministratorr

Application Application DeveloperDeveloper End - UserEnd - User

Integrated Turnkey HPC Cluster Solution

Simplified Setup and Deployment

Built-In Diagnostics Efficient Cluster

Utilization Integrates with IT

Infrastructure and Policies

Integrated Tools for Parallel Programming

Highly Productive Parallel Programming Frameworks

Service-Oriented HPC Applications

Support for Key HPC Development Standards

Unix Application Migration

Seamless Integration with Workstation Applications

Integration with Existing Collaboration and Workflow Solutions

Secure Job Execution and Data Access

Windows HPC allows you to accomplish more, in less time, with reduced effort by leveraging users existing skills and integrating

with the tools they are already using.

1 Cluster Computing Cheng-Zhong Xu. 2 Outline Cluster Computing Basics –Multicore architecture...

Documents

Distributed Cluster Computing Platforms

Cluster Computing with Dryad

Cluster Computing Seminar

Cluster Computing & OpenSource Verfugba rkeit der …...Capability Computing ben otigt ) Cluster u.U. ungeeignet 15/86 Begri / Cluster Cluster Verknupfung von Standardkomponenten h

Cluster Computing

Parallel & Cluster Computing - Western Michigan University · 2013. 8. 3. · What is cluster computing? Classification of Cluster Computing Technologies: Beowulf cluster Construction

Cluster Computing at IQSS

Cluster Computing with DryadLINQ

High Performance Cluster Computing

Cluster Computing Architecture Intel Labs - 01.org · Cluster Computing Architecture 10 *[Neo4j] ... GraphBuilder makes it easy. ... Our Wikipedia Graphs 38 Cluster Computing Architecture

Cluster Computing Au

Cluster grid cloud computing

Parallel Programming & Cluster Computing High Throughput Computing

Experiencing Cluster Computing

Lecture 4 Cluster Computing

HPC Cluster & Cloud Computing

Nav Cluster Computing

Cluster Computing PPT

Cluster Computing: An Introduction

Cluster computing report