Upload
cybercbm
View
8.380
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Citation preview
1
Rajkumar Buyya, Monash University, Melbourne.
Email: [email protected] / [email protected]
Web: http://www.ccse.monash.edu.au/~rajkumar / www.buyya.com
High Performance Cluster Computing(Architecture, Systems, and Applications)
ISCA2000
2
Objectives
Learn and Share Recent advances in cluster computing (both in research and commercial settings):
– Architecture, – System Software– Programming Environments and Tools– Applications
Cluster Computing Infoware: (tutorial online)
– http://www.buyya.com/cluster/
3
Agenda
Overview of ComputingMotivations & Enabling TechnologiesCluster Architecture & its ComponentsClusters ClassificationsCluster MiddlewareSingle System ImageRepresentative Cluster Systems Resources and Conclusions
4
P PP P P PMicrokernelMicrokernel
Multi-Processor Computing System
Threads InterfaceThreads Interface
Hardware
Operating System
ProcessProcessor ThreadPP
Applications
Computing Elements
Programming Paradigms
5
Architectures System Software Applications P.S.Es Architectures System
Software
Applications P.S.Es
SequentialEra
ParallelEra
1940 50 60 70 80 90 2000 2030
Two Eras of Computing
Commercialization R & D Commodity
6
Computing Power andComputer Architectures
7
Computing Power (HPC) Drivers
Solving grand challenge applications using computer modeling, simulation and analysis
Life SciencesLife Sciences
CAD/CAMCAD/CAM
AerospaceAerospace
Military ApplicationsDigital BiologyDigital Biology Military ApplicationsMilitary Applications
E-commerce/anything
8
How to Run App. Faster ?
There are 3 ways to improve performance:
– 1. Work Harder– 2. Work Smarter– 3. Get Help
Computer Analogy
–1. Use faster hardware: e.g. reduce the time per instruction (clock cycle).
–2. Optimized algorithms and techniques
–3. Multiple computers to solve problem: That is, increase no. of instructions executed per clock cycle.
9
21 00
2 1 00 2 1 00 2 1 002 1 00
2 1 00 2 1 00 2 1 002 1 00
Desktop(Single Processor?)
SMPs orSuperCom
puters
LocalCluster
GlobalCluster/Grid
PERFORMANCE
Computing Platforms EvolutionComputing Platforms EvolutionBreaking Administrative BarriersBreaking Administrative Barriers
Inter PlanetCluster/Grid ??
IndividualGroupDepart mentCampusSta teNationalGlobeInte r Plane tUniverse
Administrative Barriers
EnterpriseCluster/Grid
?
10
Application Case Study
Web Serving and E-Commerce”
11
E-Commerce and PDC ?
What are/will be the major problems/issues in eCommerce? How will or can PDC be applied to solve some of them?
Other than “Compute Power”, what else can PDC contribute to e-commerce?
How would/could the different forms of PDC (clusters, hypercluster, GRID,…) be applied to e-commerce?
Could you describe one hot research topic for PDC applying to e-commerce?
A killer e-commerce application for PDC ? …...
12
Killer Applications of Clusters
Numerous Scientific & Engineering Apps. Parametric Simulations Business Applications
– E-commerce Applications (Amazon.com, eBay.com ….)– Database Applications (Oracle on cluster)– Decision Support Systems
Internet Applications– Web serving / searching– Infowares (yahoo.com, AOL.com)– ASPs (application service providers)– eMail, eChat, ePhone, eBook, eCommerce, eBank, eSociety, eAnything!– Computing Portals
Mission Critical Applications– command control systems, banks, nuclear reactor control, star-war, and handling life
threatening situations.
13
Major problems/issues in E-commerce
Social Issues Capacity Planning Multilevel Business Support (e.g., B2P2C) Information Storage, Retrieval, and Update Performance Heterogeneity System Scalability System Reliability Identification and Authentication System Expandability Security Cyber Attacks Detection and Control
(cyberguard) Data Replication, Consistency, and Caching Manageability (administration and control)
14
Amazon.com: Online sales/trading killer E-commerce
Portal Several Thousands of Items
– books, publishers, suppliers Millions of Customers
– Customers details, transactions details, support for transactions update
(Millions) of Partners
– Keep track of partners details, tracking referral link to partner and sales and payment
Sales based on advertised price Sales through auction/bids
– A mechanism for participating in the bid (buyers/sellers define rules of the game)
15
Can these driveE-Commerce ?
Clusters are already in use for web serving, web-hosting, and number of other Internet applications including E-commerce
– scalability, availability, performance, reliable-high performance-massive storage and database support.
– Attempts to support online detection of cyber attacks (through data mining) and control
Hyperclusters and the GRID:
– Support for transparency in (secure) Site/Data Replication for high availability and quick response time (taking site close to the user).
– Compute power from hyperclusters/Grid can be used for data mining for cyber attacks and fraud detection and control.
– Helps to build Compute Power Market, ASPs, and Computing Portals.
2100 2100 2100 2100
2100 2100 2100 2100
16
Science Portals - e.g., PAPIA system
PAPIA PC Cluster
PentiumsMyrinetNetBSD/LinuuxPMScore-DMPC++
RWCP Japan: http://www.rwcp.or.jp/papia/
17
PDC hot topics for E-commerce
Cluster based web-servers, search engineers, portals… Scheduling and Single System Image. Heterogeneous Computing Reliability and High Availability and Data Recovery Parallel Databases and high performance-reliable-mass storage systems. CyberGuard! Data mining for detection of cyber attacks, frauds, etc.
detection and online control. Data Mining for identifying sales pattern and automatically tuning portal
to special sessions/festival sales eCash, eCheque, eBank, eSociety, eGovernment, eEntertainment,
eTravel, eGoods, and so on. Data/Site Replications and Caching techniques Compute Power Market Infowares (yahoo.com, AOL.com) ASPs (application service providers) . . .
18
Sequential Architecture Limitations
Sequential architectures reaching physical limitation (speed of light, thermodynamics)
Hardware improvements like pipelining, Superscalar, etc., are non-scalable and requires sophisticated Compiler Technology.
Vector Processing works well for certain kind of problems.
19
No. of Processors
C.P
.I.
1 2 . . . .
Computational Power Improvement
Multiprocessor
Uniprocessor
20
Age
Gro
wth
5 10 15 20 25 30 35 40 45 . . . .
Human Physical Growth Analogy:Computational Power Improvement
Vertical Horizontal
21
The Tech. of PP is mature and can be exploited commercially; significant R & D work on development of tools & environment.
Significant development in Networking technology is paving a way for heterogeneous computing.
Why Parallel Processing NOW?
22
History of Parallel Processing
PP can be traced to a tablet dated around 100 BC.
Tablet has 3 calculating positions. Infer that multiple positions:
Reliability/ Speed
23
Aggregated speed with
which complex calculations
carried out by millions of neurons in human brain is amazing! although individual neurons response is slow (milli sec.) - demonstrate the feasibility of PP
Motivating Factors
24
Simple classification by Flynn: (No. of instruction and data streams)
SISD - conventional SIMD - data parallel, vector computing MISD - systolic arrays MIMD - very general, multiple
approaches.
Current focus is on MIMD model, using general purpose processors or multicomputers.
Taxonomy of Architectures
25
Main HPC Architectures..1a
SISD - mainframes, workstations, PCs. SIMD Shared Memory - Vector machines,
Cray... MIMD Shared Memory - Sequent, KSR,
Tera, SGI, SUN. SIMD Distributed Memory - DAP, TMC CM-
2... MIMD Distributed Memory - Cray T3D,
Intel, Transputers, TMC CM-5, plus recent workstation clusters (IBM SP2, DEC, Sun, HP).
26
Motivation for using Clusters
The communications bandwidth between workstations is increasing as new networking technologies and protocols are implemented in LANs and WANs.
Workstation clusters are easier to integrate into existing networks than special parallel computers.
27
Main HPC Architectures..1b.
NOTE: Modern sequential machines are not purely SISD - advanced RISC processors use many concepts from
– vector and parallel architectures (pipelining, parallel execution of instructions, prefetching of data, etc) in order to achieve one or more arithmetic operations per clock cycle.
28
Parallel Processing Paradox
Time required to develop a parallel application for solving GCA is equal to:
– Half Life of Parallel Supercomputers.
29
The Need for Alternative
Supercomputing Resources
Vast numbers of under utilised workstations available to use.
Huge numbers of unused processor cycles and resources that could be put to good use in a wide variety of applications areas.
Reluctance to buy Supercomputer due to their cost and short life span.
Distributed compute resources “fit” better into today's funding model.
30
Technology Trend
31
Scalable Parallel Computers
32
Design Space of Competing Computer
Architecture
33
Towards Inexpensive Supercomputing
It is:
Cluster Computing..The Commodity Supercomputing!
34
Cluster Computing - Research Projects
Beowulf (CalTech and NASA) - USA CCS (Computing Centre Software) - Paderborn, Germany Condor - Wisconsin State University, USA DQS (Distributed Queuing System) - Florida State University, US. EASY - Argonne National Lab, USA HPVM -(High Performance Virtual Machine),UIUC&now UCSB,US far - University of Liverpool, UK Gardens - Queensland University of Technology, Australia MOSIX - Hebrew University of Jerusalem, Israel MPI (MPI Forum, MPICH is one of the popular implementations) NOW (Network of Workstations) - Berkeley, USA NIMROD - Monash University, Australia NetSolve - University of Tennessee, USA PBS (Portable Batch System) - NASA Ames and LLNL, USA PVM - Oak Ridge National Lab./UTK/Emory, USA
35
Cluster Computing - Commercial Software
Codine (Computing in Distributed Network Environment) - GENIAS GmbH, Germany
LoadLeveler - IBM Corp., USA LSF (Load Sharing Facility) - Platform Computing, Canada NQE (Network Queuing Environment) - Craysoft Corp., USA OpenFrame - Centre for Development of Advanced
Computing, India RWPC (Real World Computing Partnership), Japan Unixware (SCO-Santa Cruz Operations,), USA Solaris-MC (Sun Microsystems), USA ClusterTools (A number for free HPC clusters tools from Sun) A number of commercial vendors worldwide are offering
clustering solutions including IBM, Compaq, Microsoft, a number of startups like TurboLinux, HPTI, Scali, BlackStone…..)
36
Motivation for using Clusters
Surveys show utilisation of CPU cycles of desktop workstations is typically <10%.
Performance of workstations and PCs is rapidly improving
As performance grows, percent utilisation will decrease even further!
Organisations are reluctant to buy large supercomputers, due to the large expense and short useful life span.
37
Motivation for using Clusters
The development tools for workstations are more mature than the contrasting proprietary solutions for parallel computers - mainly due to the non-standard nature of many parallel systems.
Workstation clusters are a cheap and readily available alternative to specialised High Performance Computing (HPC) platforms.
Use of clusters of workstations as a distributed compute resource is very cost effective - incremental growth of system!!!
38
Cycle Stealing
Usually a workstation will be owned by an individual, group, department, or organisation - they are dedicated to the exclusive use by the owners.
This brings problems when attempting to form a cluster of workstations for running distributed applications.
39
Cycle Stealing
Typically, there are three types of owners, who use their workstations mostly for:
1. Sending and receiving email and preparing documents.
2. Software development - edit, compile, debug and test cycle.
3. Running compute-intensive applications.
40
Cycle Stealing
Cluster computing aims to steal spare cycles from (1) and (2) to provide resources for (3).
However, this requires overcoming the ownership hurdle - people are very protective of their workstations.
Usually requires organisational mandate that computers are to be used in this way.
Stealing cycles outside standard work hours (e.g. overnight) is easy, stealing idle cycles during work hours without impacting interactive use (both CPU and memory) is much harder.
41
Rise & Fall of Computing
Technologies
Mainframes Minis PCs
Minis PCs NetworkComputing
1970 1980 1995
42
Original Food Chain Picture
43
1984 Computer Food Chain
Mainframe
Vector Supercomputer
Mini ComputerWorkstation
PC
44
Mainframe
Vector Supercomputer MPP
WorkstationPC
1994 Computer Food Chain
Mini Computer(hitting wall soon)
(future is bleak)
45
Computer Food Chain (Now and Future)
46
What is a cluster?
A cluster is a type of parallel or distributed processing system, which consists of a collection of interconnected stand-alone/complete computers cooperatively working together as a single, integrated computing resource.
A typical cluster:– Network: Faster, closer connection than a typical network
(LAN)– Low latency communication protocols– Looser connection than SMP
47
Why Clusters now?(Beyond Technology and
Cost)
Building block is big enough
– complete computers (HW & SW) shipped in millions: killer micro, killer RAM, killer disks,killer OS, killer networks, killer apps.
Workstations performance is doubling every 18 months.
Networks are faster Higher link bandwidth (v 10Mbit Ethernet)
Switch based networks coming (ATM) Interfaces simple & fast (Active Msgs)
Striped files preferred (RAID) Demise of Mainframes, Supercomputers, & MPPs
48
Architectural Drivers…(cont)
Node architecture dominates performance
– processor, cache, bus, and memory– design and engineering $ => performance
Greatest demand for performance is on large systems
– must track the leading edge of technology without lag MPP network technology => mainstream
– system area networks System on every node is a powerful enabler
– very high speed I/O, virtual memory, scheduling, …
49
...Architectural Drivers
Clusters can be grown: Incremental scalability (up, down, and across)
– Individual nodes performance can be improved by adding additional resource (new memory blocks/disks)
– New nodes can be added or nodes can be removed– Clusters of Clusters and Metacomputing
Complete software tools
– Threads, PVM, MPI, DSM, C, C++, Java, Parallel C++, Compilers, Debuggers, OS, etc.
Wide class of applications
– Sequential and grand challenging parallel applications
Clustering of Computers for Collective Computing:
Trends
1960 1990 1995+ 2000
?
51
Example Clusters:Berkeley NOW
100 Sun UltraSparcs
– 200 disks Myrinet
SAN– 160 MB/s
Fast comm.– AM, MPI, ...
Ether/ATM switched external net
Global OS Self Config
52
Basic Components
$
P
M I/O bus
MyriNet
P
Sun Ultra 170
MyricomNIC
160 MB/s
M
53
Massive Cheap Storage Cluster
Basic unit:
2 PCs double-ending four SCSI chains of 8 disks each
Currently serving Fine Art at http://www.thinker.org/imagebase/
54
Cluster of SMPs (CLUMPS)
Four Sun E5000s
– 8 processors– 4 Myricom NICs each
Multiprocessor, Multi-NIC, Multi-Protocol
NPACI => Sun 450s
55
Millennium PC Clumps
Inexpensive, easy to manage Cluster
Replicated in many departments
Prototype for very large PC cluster
56
Adoption of the Approach
57
So What’s So Different?
Commodity parts? Communications Packaging? Incremental Scalability? Independent Failure? Intelligent Network Interfaces? Complete System on every node
– virtual memory– scheduler– files– ...
58
OPPORTUNITIES &
CHALLENGES
59
Shared Pool ofComputing Resources:
Processors, Memory, Disks
Interconnect
Guarantee atleast oneworkstation to many individuals
(when active)
Deliver large % of collectiveresources to few individuals
at any one time
Opportunity of Large-scaleComputing on NOW
60
Windows of Opportunities
MPP/DSM:
– Compute across multiple systems: parallel. Network RAM:
– Idle memory in other nodes. Page across other nodes idle memory
Software RAID:
– file system supporting parallel I/O and reliablity, mass-storage.
Multi-path Communication:
– Communicate across multiple networks: Ethernet, ATM, Myrinet
61
Parallel Processing
Scalable Parallel Applications require
– good floating-point performance– low overhead communication scalable
network bandwidth– parallel file system
62
Network RAM
Performance gap between processor and disk has widened.
Thrashing to disk degrades performance significantly
Paging across networks can be effective with high performance networks and OS that recognizes idle machines
Typically thrashing to network RAM can be 5 to 10 times faster than thrashing to disk
63
Software RAID: Redundant Array of
Workstation Disks I/O Bottleneck:
– Microprocessor performance is improving more than 50% per year.
– Disk access improvement is < 10%
– Application often perform I/O RAID cost per byte is high compared to single disks RAIDs are connected to host computers which are often a
performance and availability bottleneck RAID in software, writing data across an array of
workstation disks provides performance and some degree of redundancy provides availability.
64
Software RAID, Parallel File Systems, and
Parallel I/O
65
Cluster Computer and its Components
66
Clustering Today
Clustering gained momentum when 3 technologies converged:
– 1. Very HP Microprocessors• workstation performance = yesterday supercomputers
– 2. High speed communication• Comm. between cluster nodes >= between processors in an
SMP.
– 3. Standard tools for parallel/ distributed computing & their growing popularity.
67
Cluster Computer Architecture
68
Cluster Components...1a
Nodes
Multiple High Performance Components:– PCs– Workstations– SMPs (CLUMPS)– Distributed HPC Systems leading to Metacomputing
They can be based on different architectures and running difference OS
69
Cluster Components...1b
Processors There are many
(CISC/RISC/VLIW/Vector..)– Intel: Pentiums, Xeon, Merceed….– Sun: SPARC, ULTRASPARC– HP PA– IBM RS6000/PowerPC– SGI MPIS– Digital Alphas
Integrate Memory, processing and networking into a single chip– IRAM (CPU & Mem): (http://iram.cs.berkeley.edu)– Alpha 21366 (CPU, Memory Controller, NI)
70
Cluster Components…2OS
State of the art OS:– Linux (Beowulf)
– Microsoft NT (Illinois HPVM)
– SUN Solaris (Berkeley NOW)
– IBM AIX (IBM SP2)
– HP UX (Illinois - PANDA)
– Mach (Microkernel based OS) (CMU)
– Cluster Operating Systems (Solaris MC, SCO Unixware, MOSIX (academic project)
– OS gluing layers: (Berkeley Glunix)
71
Cluster Components…3High Performance
Networks
Ethernet (10Mbps), Fast Ethernet (100Mbps), Gigabit Ethernet (1Gbps) SCI (Dolphin - MPI- 12micro-sec
latency) ATM Myrinet (1.2Gbps) Digital Memory Channel FDDI
72
Cluster Components…4Network Interfaces
Network Interface Card
– Myrinet has NIC
– User-level access support
– Alpha 21364 processor integrates processing, memory controller, network interface into a single chip..
73
Cluster Components…5 Communication
Software Traditional OS supported facilities (heavy
weight due to protocol processing)..
– Sockets (TCP/IP), Pipes, etc. Light weight protocols (User Level)
– Active Messages (Berkeley)– Fast Messages (Illinois)– U-net (Cornell)– XTP (Virginia)
System systems can be built on top of the above protocols
74
Cluster Components…6a
Cluster Middleware
Resides Between OS and Applications and offers in infrastructure for supporting:– Single System Image (SSI)– System Availability (SA)
SSI makes collection appear as single machine (globalised view of system resources). Telnet cluster.myinstitute.edu
SA - Check pointing and process migration..
75
Cluster Components…6b
Middleware Components
Hardware – DEC Memory Channel, DSM (Alewife, DASH) SMP
Techniques
OS / Gluing Layers– Solaris MC, Unixware, Glunix)
Applications and Subsystems– System management and electronic forms– Runtime systems (software DSM, PFS etc.)– Resource management and scheduling (RMS):
• CODINE, LSF, PBS, NQS, etc.
76
Cluster Components…7aProgramming environments
Threads (PCs, SMPs, NOW..) – POSIX Threads
– Java Threads MPI
– Linux, NT, on many Supercomputers PVM Software DSMs (Shmem)
77
Cluster Components…7b
Development Tools ?
Compilers– C/C++/Java/ ;
– Parallel programming with C++ (MIT Press book)
RAD (rapid application development tools).. GUI based tools for PP modeling
Debuggers Performance Analysis Tools Visualization Tools
78
Cluster Components…8Applications
Sequential Parallel / Distributed (Cluster-aware
app.)
– Grand Challenging applications• Weather Forecasting
• Quantum Chemistry
• Molecular Biology Modeling
• Engineering Analysis (CAD/CAM)
• ……………….
– PDBs, web servers,data-mining
79
Key Operational Benefits of Clustering
System availability (HA). offer inherent high system availability due to the redundancy of hardware, operating systems, and applications.
Hardware Fault Tolerance. redundancy for most system components (eg. disk-RAID), including both hardware and software.
OS and application reliability. run multiple copies of the OS and applications, and through this redundancy
Scalability. adding servers to the cluster or by adding more clusters to the network as the need arises or CPU to SMP.
High Performance. (running cluster enabled programs)
80
Classification
of Cluster Computer
81
Clusters Classification..1
Based on Focus (in Market)
– High Performance (HP) Clusters• Grand Challenging Applications
– High Availability (HA) Clusters• Mission Critical applications
82
HA Cluster: Server Cluster with "Heartbeat" Connection
83
Clusters Classification..2
Based on Workstation/PC Ownership
– Dedicated Clusters
– Non-dedicated clusters• Adaptive parallel computing
• Also called Communal multiprocessing
84
Clusters Classification..3
Based on Node Architecture..
– Clusters of PCs (CoPs)
– Clusters of Workstations (COWs)
– Clusters of SMPs (CLUMPs)
85
Building Scalable Systems: Cluster of SMPs (Clumps)
Performance of SMP Systems Vs. Four-Processor Servers in a Cluster
86
Clusters Classification..4
Based on Node OS Type..
– Linux Clusters (Beowulf)
– Solaris Clusters (Berkeley NOW)
– NT Clusters (HPVM)
– AIX Clusters (IBM SP2)
– SCO/Compaq Clusters (Unixware)
– …….Digital VMS Clusters, HP clusters, ………………..
87
Clusters Classification..5
Based on node components architecture & configuration (Processor Arch, Node Type: PC/Workstation.. & OS: Linux/NT..):
– Homogeneous Clusters• All nodes will have similar configuration
– Heterogeneous Clusters• Nodes based on different processors and
running different OSes.
88
Clusters Classification..6a
Dimensions of Scalability & Levels of Clustering
Network
Technology
Platform
Uniprocessor
SMPCluster
MPP
CPU / I/O / M
emory / OS
(1)
(2)
(3)
Campus
Enterprise
Workgroup
Department
Public Metacomputing (GRID)
89
Clusters Classification..6b
Levels of Clustering Group Clusters (#nodes: 2-99)
– (a set of dedicated/non-dedicated computers - mainly connected by SAN like Myrinet)
Departmental Clusters (#nodes: 99-999) Organizational Clusters (#nodes: many 100s) (using ATMs Net) Internet-wide Clusters=Global Clusters:
(#nodes: 1000s to many millions)– Metacomputing– Web-based Computing– Agent Based Computing
• Java plays a major in web and agent based computing
90
Size Scalability (physical & application)
Enhanced Availability (failure management)
Single System Image (look-and-feel of one system)
Fast Communication (networks & protocols)
Load Balancing (CPU, Net, Memory, Disk)
Security and Encryption (clusters of clusters)
Distributed Environment (Social issues)
Manageability (admin. And control)
Programmability (simple API if required)
Applicability (cluster-aware and non-aware app.)
Major issues in cluster design
91
Cluster Middleware
and
Single System Image
92
A typical Cluster Computing Environment
PVM / MPI/ RSH
Application
Hardware/OS
???
93
CC should support
Multi-user, time-sharing environments
Nodes with different CPU speeds and memory sizes
(heterogeneous configuration)
Many processes, with unpredictable requirements
Unlike SMP: insufficient “bonds” between nodes
– Each computer operates independently
– Inefficient utilization of resources
94
The missing link is provide by cluster middleware/underware
PVM / MPI/ RSH
Application
Hardware/OS
Middleware or Underware
95
SSI Clusters--SMP services on a CC
Adaptive resource usage for better performance
Ease of use - almost like SMP
Scalable configurations - by decentralized control
Result: HPC/HAC at PC/Workstation prices
“Pool Together” the “Cluster-Wide” resources
96
What is Cluster Middleware ?
An interface between between use applications and cluster hardware and OS platform.
Middleware packages support each other at the management, programming, and implementation levels.
Middleware Layers:– SSI Layer
– Availability Layer: It enables the cluster services of
• Checkpointing, Automatic Failover, recovery from failure,
• fault-tolerant operating among all cluster nodes.
97
Middleware Design Goals
Complete Transparency (Manageability)
– Lets the see a single cluster system..• Single entry point, ftp, telnet, software loading...
Scalable Performance
– Easy growth of cluster• no change of API & automatic load distribution.
Enhanced Availability
– Automatic Recovery from failures• Employ checkpointing & fault tolerant technologies
– Handle consistency of data when replicated..
98
What is Single System Image (SSI) ?
A single system image is the illusion, created by software or hardware, that presents a collection of resources as one, more powerful resource.
SSI makes the cluster appear like a single machine to the user, to applications, and to the network.
A cluster without a SSI is not a cluster
99
Benefits of Single System Image
Usage of system resources transparently Transparent process migration and load
balancing across nodes. Improved reliability and higher availability Improved system response time and performance Simplified system management Reduction in the risk of operator errors User need not be aware of the underlying system
architecture to use these machines effectively
100
Desired SSI Services
Single Entry Point
– telnet cluster.my_institute.edu– telnet node1.cluster. institute.edu
Single File Hierarchy: xFS, AFS, Solaris MC Proxy
Single Control Point: Management from single GUI
Single virtual networking Single memory space - Network RAM / DSM Single Job Management: Glunix, Codine, LSF Single User Interface: Like workstation/PC
windowing environment (CDE in Solaris/NT), may it can use Web technology
101
Availability Support Functions
Single I/O Space (SIO):
– any node can access any peripheral or disk devices without the knowledge of physical location.
Single Process Space (SPS)
– Any process on any node create process with cluster wide process wide and they communicate through signal, pipes, etc, as if they are one a single node.
Checkpointing and Process Migration.
– Saves the process state and intermediate results in memory to disk to support rollback recovery when node fails. PM for Load balancing...
Reduction in the risk of operator errors
User need not be aware of the underlying system architecture to use these machines effectively
102
Scalability Vs. Single System Image
UP
103
SSI Levels/How do we implement SSI ?
It is a computer science notion of levels of abstractions (house is at a higher level of abstraction than walls, ceilings, and floors).
Application and Subsystem Level
Operating System Kernel Level
Hardware Level
104
SSI at Application and Subsystem Level
Level Examples Boundary Importance
application cluster batch system,system management
subsystem
file system
distributed DB,OSF DME, Lotus Notes, MPI, PVM
an application what a userwants
Sun NFS, OSF,DFS, NetWare,and so on
a subsystem SSI for allapplications ofthe subsystem
implicitly supports many applications and subsystems
shared portion of the file system
toolkit OSF DCE, SunONC+, ApolloDomain
best level ofsupport for heter-ogeneous system
explicit toolkitfacilities: user,service name,time
(c) In search of clusters
105
SSI at Operating System Kernel Level
Level Examples Boundary Importance
Kernel/OS Layer
Solaris MC, Unixware MOSIX, Sprite,Amoeba/ GLunix
kernelinterfaces
virtualmemory
UNIX (Sun) vnode,Locus (IBM) vproc
each name space:files, processes, pipes, devices, etc.
kernel support forapplications, admsubsystems
none supportingoperating system kernel
type of kernelobjects: files,processes, etc.
modularizes SSIcode within kernel
may simplifyimplementationof kernel objects
each distributedvirtual memoryspace
microkernel Mach, PARAS, Chorus,OSF/1AD, Amoeba
implicit SSI forall system services
each serviceoutside themicrokernel
(c) In search of clusters
106
SSI at Harware Level
Level Examples Boundary Importance
memory SCI, DASH better communica-tion and synchro-nization
memory space
memory and I/O
SCI, SMP techniques lower overheadcluster I/O
memory and I/Odevice space
Application and Subsystem Level
Operating System Kernel Level
(c) In search of clusters
107
SSI Characteristics
1. Every SSI has a boundary 2. Single system support can
exist at different levels within a system, one able to be build on another
108
SSI Boundaries -- an applications SSI
boundary
Batch System
SSIBoundary
(c) In search of clusters
109
Relationship Among Middleware Modules
110
SSI via OS path!
1. Build as a layer on top of the existing OS
– Benefits: makes the system quickly portable, tracks vendor software upgrades, and reduces development time.
– i.e. new systems can be built quickly by mapping new services onto the functionality provided by the layer beneath. Eg: Glunix
2. Build SSI at kernel level, True Cluster OS
– Good, but Can’t leverage of OS improvements by vendor
– E.g. Unixware, Solaris-MC, and MOSIX
111
SSI Representative Systems
OS level SSI– SCO NSC UnixWare
– Solaris-MC
– MOSIX, …. Middleware level SSI
– PVM, TreadMarks (DSM), Glunix, Condor, Codine, Nimrod, ….
Application level SSI– PARMON, Parallel Oracle, ...
112
SCO NonStop® Cluster for UnixWare
Users, applications, and systems management
Users, applications, and systems management
Standard OS kernel calls Standard OS kernel calls
Modularkernel
extensions
Modularkernel
extensions
Extensions
UP or SMP node
Users, applications, and systems management
Users, applications, and systems management
Standard OS kernel calls Standard OS kernel calls
Modular kernel
extensions
Modular kernel
extensions
Extensions
Devices Devices
ServerNet™
UP or SMP node
Standard SCO UnixWare®
with clustering hooks
Standard SCO UnixWare
with clustering hooks
Other nodes
http://www.sco.com/products/clustering/
113
How does NonStop Clusters Work?
Modular Extensions and Hooks to Provide:
– Single Clusterwide Filesystem view
– Transparent Clusterwide device access
– Transparent swap space sharing
– Transparent Clusterwide IPC
– High Performance Internode Communications
– Transparent Clusterwide Processes, migration,etc.
– Node down cleanup and resource failover
– Transparent Clusterwide parallel TCP/IP networking
– Application Availability
– Clusterwide Membership and Cluster timesync
– Cluster System Administration
– Load Leveling
114
Solaris-MC: Solaris for MultiComputers
global file system globalized
process management
globalized networking and I/O
Solaris MC Architecture
System call interface
Network
File system
C++
Processes
Object framework
Existing Solaris 2.5 kernel
Othernodes
Object invocations
Kernel
Solaris MC
Applications
http://www.sun.com/research/solaris-mc/
115
Solaris MC components
Object and communication support
High availability support
PXFS global distributed file system
Process mangement
NetworkingSolaris MC Architecture
System call interface
Network
File system
C++
Processes
Object framework
Existing Solaris 2.5 kernel
Othernodes
Object invocations
Kernel
Solaris MC
Applications
116
Multicomputer OS for UNIX (MOSIX)
An OS module (layer) that provides the applications with the illusion of working on a single system
Remote operations are performed like local operations
Transparent to the application - user interface unchanged
PVM / MPI / RSHMOSIX
Application
Hardware/OS
http://www.mosix.cs.huji.ac.il/
117
Main tool
Supervised by distributed algorithms that respond on-line to global resource availability - transparently
Load-balancing - migrate process from over-loaded to under-loaded nodes
Memory ushering - migrate processes from a node that has exhausted its memory, to prevent paging/swapping
Preemptive process migration that can migrate--->any process, anywhere, anytime
118
MOSIX for Linux at HUJI
A scalable cluster configuration:
– 50 Pentium-II 300 MHz– 38 Pentium-Pro 200 MHz (some are SMPs)– 16 Pentium-II 400 MHz (some are SMPs)
Over 12 GB cluster-wide RAM Connected by the Myrinet 2.56 G.b/s LAN
Runs Red-Hat 6.0, based on Kernel 2.2.7 Upgrade: HW with Intel, SW with Linux Download MOSIX:
– http://www.mosix.cs.huji.ac.il/
119
NOW @ Berkeley
Design & Implementation of higher-level system
Global OS (Glunix)Parallel File Systems (xFS)Fast Communication (HW for Active
Messages)Application Support
Overcoming technology shortcomingsFault toleranceSystem Management
NOW Goal: Faster for Parallel AND Sequential
http://now.cs.berkeley.edu/
120
NOW Software Components
AM L.C.P.
VN segment Driver
UnixWorkstation
AM L.C.P.
VN segment Driver
UnixWorkstation
AM L.C.P.
VN segment Driver
UnixWorkstation
AM L.C.P.
VN segment Driver
Unix (Solaris)Workstation
Global Layer Unix
Myrinet Scalable Interconnect
Large Seq. AppsParallel Apps
Sockets, Split-C, MPI, HPF, vSM
Active MessagesName Svr
Sched
uler
121
3 Paths for Applications on
NOW? Revolutionary (MPP Style): write new programs from
scratch using MPP languages, compilers, libraries,… Porting: port programs from mainframes,
supercomputers, MPPs, … Evolutionary: take sequential program & use
1)Network RAM: first use memory of many computers to reduce disk accesses; if not fast enough, then:
2)Parallel I/O: use many disks in parallel for accesses not in file cache; if not fast enough, then:
3)Parallel program: change program until it sees enough processors that is fast=> Large speedup without fine grain parallel program
122
Comparison of 4 Cluster Systems
123
Cluster Programming Environments
Shared Memory Based
– DSM– Threads/OpenMP (enabled for clusters)– Java threads (HKU JESSICA, IBM cJVM)
Message Passing Based
– PVM (PVM)– MPI (MPI)
Parametric Computations
– Nimrod/Clustor Automatic Parallelising Compilers Parallel Libraries & Computational Kernels
(NetSolve)
124
Code-GranularityCode ItemLarge grain(task level)Program
Medium grain(control level)Function (thread)
Fine grain(data level)Loop (Compiler)
Very fine grain(multiple issue)With hardware
Code-GranularityCode ItemLarge grain(task level)Program
Medium grain(control level)Function (thread)
Fine grain(data level)Loop (Compiler)
Very fine grain(multiple issue)With hardware
Levels of Parallelism
Levels of Parallelism
Task i-lTask i-l Task iTask i Task i+1Task i+1
func1 ( ){........}
func1 ( ){........}
func2 ( ){........}
func2 ( ){........}
func3 ( ){........}
func3 ( ){........}
a ( 0 ) =..b ( 0 ) =..
a ( 0 ) =..b ( 0 ) =..
a ( 1 )=..b ( 1 )=..
a ( 1 )=..b ( 1 )=..
a ( 2 )=..b ( 2 )=..
a ( 2 )=..b ( 2 )=..
++ xx LoadLoad
PVM/MPI
Threads
Compilers
CPU
125
MPI (Message Passing Interface)
A standard message passing interface.– MPI 1.0 - May 1994 (started in 1992)– C and Fortran bindings (now Java)
Portable (once coded, it can run on virtually all HPC platforms including clusters!
Performance (by exploiting native hardware features)
Functionality (over 115 functions in MPI 1.0)– environment management, point-to-point & collective
communications, process group, communication world, derived data types, and virtual topology routines.
Availability - a variety of implementations available, both vendor and public domain.
http://www.mpi-forum.org/
126
A Sample MPI Program...
# include <stdio.h>
# include <string.h>
#include “mpi.h”
main( int argc, char *argv[ ])
{
int my_rank; /* process rank */
int p; /*no. of processes*/
int source; /* rank of sender */
int dest; /* rank of receiver */
int tag = 0; /* message tag, like “email subject” */
char message[100]; /* buffer */
MPI_Status status; /* function return status */ /* Start up MPI */ MPI_Init( &argc, &argv ); /* Find our process rank/id */ MPI_Comm_rank( MPI_COM_WORLD, &my_rank); /*Find out how many processes/tasks part of this run */ MPI_Comm_size( MPI_COM_WORLD, &p);
…
(master)
(workers)
Hello,...
127
A Sample MPI Program
if( my_rank == 0) /* Master Process */ { for( source = 1; source < p; source++) { MPI_Recv( message, 100, MPI_CHAR, source, tag, MPI_COM_WORLD, &status); printf(“%s \n”, message); } } else /* Worker Process */ { sprintf( message, “Hello, I am your worker process %d!”, my_rank ); dest = 0; MPI_Send( message, strlen(message)+1, MPI_CHAR, dest, tag,
MPI_COM_WORLD); } /* Shutdown MPI environment */ MPI_Finalise();}
128
Execution
% cc -o hello hello.c -lmpi
% mpirun -p2 hello
Hello, I am process 1!
% mpirun -p4 hello
Hello, I am process 1!
Hello, I am process 2!
Hello, I am process 3!
% mpirun hello
(no output, there are no workers.., no greetings)
129
PARMON: A Cluster Monitoring Tool
PARMONHigh-Speed
Switch
parmond
parmon
PARMON Serveron each nodePARMON Client on JVM
http://www.buyya.com/parmon/
130
Resource Utilization at a Glance
131
Single I/O Space and Single I/O Space and Design IssuesDesign Issues
Globalised Cluster Storage
Reference:Designing SSI Clusters with Hierarchical Checkpointing and Single I/O Space”, IEEE Concurrency, March, 1999
by K. Hwang, H. Jin et.al
132
Without Single I/O Space
Users
With Single I/O Space Services
Users
Single I/O Space Services
Clusters with & without Single I/O Space
133
Benefits of Single I/O SpaceBenefits of Single I/O Space
Eliminate the gap between accessing local disk(s) and remote
disks
Support persistent programming paradigm
Allow striping on remote disks, accelerate parallel I/O
operations
Facilitate the implementation of distributed checkpointing and
recovery schemes
134
Single I/O Space Design IssuesSingle I/O Space Design Issues
Integrated I/O SpaceIntegrated I/O Space
Addressing and Mapping MechanismsAddressing and Mapping Mechanisms
Data movement proceduresData movement procedures
135
Integrated I/O SpaceIntegrated I/O Space
Seq
ue
ntial
add
resses
. . .
B11
SD1 SD2 SDm
. . .
. . .
. . .
. . .
. . .
. . .
D11 D12 D1t
D21 D22 D2t
Dn1 Dn2 Dnt
B12
B1k
B21B22
B2k
Bm1Bm2
Bmk
LD1
LD2
LDn
Local Disks, (RADDSpace)
Shared RAIDs,(NASD Space)
. . .
P1
Ph
. . . Peripherals(NAP Space)
136
User-levelMiddlewareplus some Modified OSSystem Calls
User Applications
RADD
I/O Agent
Name Agent Disk/RAID/NAP Mapper
Block Mover
I/O Agent
NASD
I/O Agent
NAP
I/O Agent
Addressing and MappingAddressing and Mapping
137
Data Movement ProceduresData Movement Procedures
Node 1
LD2 or SDi
of the NASD
Block Mover
User Application
I/O Agent
Node 2
I/O Agent
A
A
LD1
Node 1
LD2 or SDi
of the NASD
Block Mover
User Application
I/O Agent
Node 2
I/O Agent
A
Request DataBlock A
LD1
138
What Next ??
Clusters of Clusters (HyperClusters)
Global Grid
Interplanetary Grid
Universal Grid??
139
Clusters of Clusters (HyperClusters)
Scheduler
MasterDaemon
ExecutionDaemon
SubmitGraphicalControl
Clients
Cluster 2
Scheduler
MasterDaemon
ExecutionDaemon
SubmitGraphicalControl
Clients
Cluster 3
Scheduler
MasterDaemon
ExecutionDaemon
SubmitGraphicalControl
Clients
Cluster 1
LAN/WAN
140
Towards Grid Computing….
For illustration, placed resources arbitrarily on the GUSTO test-bed!!
141
What is Grid ?
An infrastructure that couples– Computers (PCs, workstations, clusters, traditional supercomputers, and
even laptops, notebooks, mobile computers, PDA, and so on)– Software ? (e.g., renting expensive special purpose applications on
demand)– Databases (e.g., transparent access to human genome database)– Special Instruments (e.g., radio telescope--SETI@Home Searching for
Life in galaxy, Austrophysics@Swinburne for pulsars)– People (may be even animals who knows ?)
across the local/wide-area networks (enterprise, organisations, or Internet) and presents them as an unified integrated (single) resource.
142
Conceptual view of the Grid
Leading to Portal (Super)Computing
http://www.sun.com/hpc/
143
Grid Application-Drivers
Old and New applications getting enabled due to coupling of computers, databases, instruments, people, etc:
– (distributed) Supercomputing– Collaborative engineering– high-throughput computing
• large scale simulation & parameter studies
– Remote software access / Renting Software– Data-intensive computing– On-demand computing
144
Grid Components
GridFabricNetworked Resources across Organisations
Computers Clusters Data Sources Scientific InstrumentsStorage Systems
Local Resource Managers
Operating Systems Queuing Systems TCP/IP & UDP
…
Libraries & App Kernels …
Distributed Resources Coupling Services
Comm. Sign on & Security Information … QoSProcess Data Access
Development Environments and Tools
Languages Libraries Debuggers … Web toolsResource BrokersMonitoring
Applications and Portals
Prob. Solving Env.Scientific …CollaborationEngineering Web enabled Apps
GridApps.
GridMiddleware
GridTools
145
Many GRID Projects and Initiatives
PUBLIC FORUMS– Computing Portals– Grid Forum– European Grid Forum– IEEE TFCC!– GRID’2000 and more.
Australia– Nimrod/G– EcoGrid and GRACE– DISCWorld
Europe– UNICORE– MOL– METODIS– Globe– Poznan Metacomputing– CERN Data Grid– MetaMPI– DAS– JaWS– and many more...
Public Grid Initiatives– Distributed.net– SETI@Home– Compute Power Grid
USA– Globus– Legion– JAVELIN– AppLes– NASA IPG– Condor– Harness– NetSolve– NCSA Workbench– WebFlow– EveryWhere– and many more...
Japan– Ninf– Bricks– and many more...
http://www.gridcomputing.com/
146
NetSolve Client/Server/Agent -- Based Computing
• Client-Server design
• Network-enabled solvers
• Seamless access to resources
• Non-hierarchical system
• Load Balancing
• Fault Tolerance
• Interfaces to Fortran, C, Java, Matlab, more
Easy-to-use tool to provide efficient and uniformaccess to a variety of scientific packages on UNIX platforms
NetSolve Client NetSolve Agent
Network ResourcesSoftware Repository
Software is availablewww.cs.utk.edu/netsolve/
request
choicereply
147
Host D
Host C
Host B
Host A
VirtualMachine
Operation within VM usesDistributed Control
process control
user features
HARNESS daemon
Customizationand extensionby dynamicallyadding plug-ins
Componentbased daemon
Discovery and registration
AnotherVM
HARNESS Virtual MachineHARNESS Virtual MachineScalable Distributed control and CCA based Daemon Scalable Distributed control and CCA based Daemon
http://www.epm.ornl.gov/harness/
148
HARNESS Core ResearchHARNESS Core ResearchParallel Plug-ins for Heterogeneous Distributed Virtual Machine Parallel Plug-ins for Heterogeneous Distributed Virtual Machine
One research goal is to understand and implement a dynamic parallel plug-in environment.
provides a method for many users to extend Harness in much the same way that third party serial plug- ins extend Netscape, Photoshop, and Linux.
Research issues with Parallel plug-ins include:heterogeneity, synchronization, interoperation, partial success (three typical cases) :
•load plug- in into single host of VM w/o communication•load plug- in into single host broadcast to rest of VM•load plug- in into every host of VM w/ synchronization
149
Nimrod - A Job Management System
http://www.dgs.monash.edu.au/~davida/nimrod.html
150
Job processing with Nimrod
151
Nimrod/G Architecture
Middleware Services
Nimrod/G Client Nimrod/G ClientNimrod/G Client
Grid Information Services
Schedule Advisor
Trading Manager
Nimrod Engine
GUSTO Test Bed
Persistent Store
Grid Explorer
GE GISTM TS
RM & TS
RM & TS
RM & TS
Dispatcher
RM: Local Resource Manager, TS: Trade Server
152
User
Application
Resource Broker
A Resource Domain
Grid Explorer
Schedule Advisor
Trade Manager
Job ControlAgent
Deployment Agent
Trade Server
Resource Allocation
ResourceReservation
R1
Other services
Trading
Grid Information Server
R2 Rn…
Charging Alg.
Accounting
Compute Power Market
153
Pointers to Literature on Cluster Computing
154
Reading Resources..1aInternet & WWW
– Computer Architecture:• http://www.cs.wisc.edu/~arch/www/
– PFS & Parallel I/O• http://www.cs.dartmouth.edu/pario/
– Linux Parallel Procesing• http://yara.ecn.purdue.edu/~pplinux/Sites/
– DSMs• http://www.cs.umd.edu/~keleher/dsm.html
155
Reading Resources..1bInternet & WWW
– Solaris-MC• http://www.sunlabs.com/research/solaris-mc
– Microprocessors: Recent Advances• http://www.microprocessor.sscc.ru
– Beowulf:• http://www.beowulf.org
– Metacomputing• http://www.sis.port.ac.uk/~mab/Metacomputing/
156
Reading Resources..2Books
– In Search of Cluster• by G.Pfister, Prentice Hall (2ed), 98
– High Performance Cluster Computing• Volume1: Architectures and Systems• Volume2: Programming and Applications
– Edited by Rajkumar Buyya, Prentice Hall, NJ, USA.
– Scalable Parallel Computing• by K Hwang & Zhu, McGraw Hill,98
157
Reading Resources..3Journals
– A Case of NOW, IEEE Micro, Feb’95• by Anderson, Culler, Paterson
– Fault Tolerant COW with SSI, IEEE Concurrency, (to appear)
• by Kai Hwang, Chow, Wang, Jin, Xu
– Cluster Computing: The Commodity Supercomputing, Journal of Software Practice and Experience-(get from my web)
• by Mark Baker & Rajkumar Buyya
158
Cluster Computing Infoware
http://www.csse.monash.edu.au/~rajkumar/cluster/
159
Cluster Computing Forum
IEEE Task Force on Cluster Computing
(TFCC)
http://www.ieeetfcc.org
160
TFCC Activities...
Network Technologies OS Technologies Parallel I/O Programming Environments Java Technologies Algorithms and Applications >Analysis and Profiling Storage Technologies High Throughput Computing
161
TFCC Activities...
High Availability Single System Image Performance Evaluation Software Engineering Education Newsletter Industrial Wing TFCC Regional Activities
– All the above have there own pages, see pointers from: – http://www.ieeetfcc.org
162
TFCC Activities...
Mailing list, Workshops, Conferences, Tutorials, Web-resources etc.
Resources for introducing subject in senior undergraduate and graduate levels.
Tutorials/Workshops at IEEE Chapters.. ….. and so on. FREE MEMBERSHIP, please join! Visit TFCC Page for more details:
– http://www.ieeetfcc.org (updated daily!).
163
Clusters Revisited
164
Summary
We have discussed Clusters
Enabling TechnologiesArchitecture & its ComponentsClassificationsMiddlewareSingle System ImageRepresentative Systems
165
Conclusions
Clusters are promising..
Solve parallel processing paradoxOffer incremental growth and matches with funding
pattern.New trends in hardware and software technologies
are likely to make clusters more promising..so thatClusters based supercomputers can be seen
everywhere!
166
21 00
2 1 00 2 1 00 2 1 002 1 00
2 1 00 2 1 00 2 1 002 1 00
Desktop(Single Processor?)
SMPs orSuperCom
puters
LocalCluster
GlobalCluster/Grid
PERFORMANCE
Computing Platforms EvolutionComputing Platforms EvolutionBreaking Administrative BarriersBreaking Administrative Barriers
Inter PlanetCluster/Grid ??
IndividualGroupDepart mentCampusSta teNationalGlobeInte r Plane tUniverse
Administrative Barriers
EnterpriseCluster/Grid
?
167
Thank You ...Thank You ...
?
168
Backup Slides...
169
SISD : A Conventional Computer
Speed is limited by the rate at which computer can transfer information internally.
ProcessorProcessorData Input Data Output
Instru
ctions
Ex:PC, Macintosh, Workstations
170
The MISD Architecture
More of an intellectual exercise than a practical configuration. Few built, but commercially not available
Data InputStream
Data OutputStream
Processor
A
Processor
B
Processor
C
InstructionStream A
InstructionStream B
Instruction Stream C
171
SIMD Architecture
Ex: CRAY machine vector processing, Thinking machine cm*
Ci<= Ai * Bi
InstructionStream
Processor
A
Processor
B
Processor
C
Data Inputstream A
Data Inputstream B
Data Inputstream C
Data Outputstream A
Data Outputstream B
Data Outputstream C
172
Unlike SISD, MISD, MIMD computer works asynchronously.
Shared memory (tightly coupled) MIMD
Distributed memory (loosely coupled) MIMD
MIMD Architecture
Processor
A
Processor
B
Processor
C
Data Inputstream A
Data Inputstream B
Data Inputstream C
Data Outputstream A
Data Outputstream B
Data Outputstream C
InstructionStream A
InstructionStream B
InstructionStream C