Scalable Systems and Technology Einar Rustad Scali AS

Scalable SystemsScalable Systems and and

TechnologyTechnologyEinar Rustad

Scali [email protected]

http://www.scali.com

Slide 2 - 03.05.23

Definition of ClusterDefinition of Cluster

• The Widest Definition:The Widest Definition:– Any number of computers communicating at Any number of computers communicating at

any distanceany distance• The Common Definition:The Common Definition:

– A relatively small number of computers A relatively small number of computers (<1000) communicating at a relatively small (<1000) communicating at a relatively small distance (within the same room) and used asdistance (within the same room) and used asa single, shared computing resourcea single, shared computing resource

Slide 3 - 03.05.23

Increasing PerformanceIncreasing Performance

• Faster ProcessorsFaster Processors– Frequency Frequency – Instruction Level Parallelism (ILP)Instruction Level Parallelism (ILP)

• Better AlgorithmsBetter Algorithms– Compilers Compilers – ManpowerManpower

• Parallel ProcessingParallel Processing– CompilersCompilers– Tools (Profilers, Debuggers) Tools (Profilers, Debuggers) – More ManpowerMore Manpower

Slide 4 - 03.05.23

Use of ClustersUse of Clusters

• Capacity ServersCapacity Servers– Data BasesData Bases– Client/Server ComputingClient/Server Computing

• Throughput ServersThroughput Servers– Numerical ApplicationsNumerical Applications– Simulation and ModellingSimulation and Modelling

• High Availability ServersHigh Availability Servers– Transaction ProcessingTransaction Processing

Slide 5 - 03.05.23

Why ClusteringWhy Clustering

• Scaling of ResourcesScaling of Resources• Sharing of ResourcesSharing of Resources• Best Price/Performance Ratio (PPR)Best Price/Performance Ratio (PPR)

– PPR is Constant with Growing System SizePPR is Constant with Growing System Size• FlexibilityFlexibility• High AvailabilityHigh Availability• Fault ResilienceFault Resilience

Slide 6 - 03.05.23

Clusters vs SMPs (1)Clusters vs SMPs (1)

• ProgrammingProgramming– A Program written for Cluster Parallelism can A Program written for Cluster Parallelism can

run on an SMP right awayrun on an SMP right away– A Program written for an SMP can NOT run on A Program written for an SMP can NOT run on

a Cluster right awaya Cluster right away• ScalabilityScalability

– Clusters are ScalableClusters are Scalable– SMPs are NOT Scalable above a Small SMPs are NOT Scalable above a Small

Number of ProcessorsNumber of Processors

Slide 7 - 03.05.23

Why SMPs don´t scaleWhy SMPs don´t scale

CPU CPU CPU CPU

I/OMemory

CPU CPU CPU

Memory

CPU

Memory

Interconnect

L3CLink

L3CLink

This is an SMP This is NOT an SMP...

When CPUs cycle at 1GHz and Memory latency is >100nS, 1% Cache Miss implies <50% CPU Efficiency

But, You can make all the Memory Equally SLOW….( X-bar complexity grows with # of ports squared)

Slide 8 - 03.05.23

Clusters vs SMPs (2)Clusters vs SMPs (2)

Use of SMPsUse of SMPs• Common Access to Shared Common Access to Shared

ResourcesResources– ProcessorsProcessors– MemoryMemory– Storage DevicesStorage Devices

• Running Multiple Running Multiple ApplicationsApplications

• Running Multiple Instances Running Multiple Instances of the Same Applicationof the Same Application

• Running Parallel Running Parallel ApplicationsApplications

Use of ClustersUse of Clusters• Common Access to Shared Common Access to Shared

ResourcesResources– ProcessorsProcessors– Distributed MemoryDistributed Memory– Storage DevicesStorage Devices

• Running Multiple Running Multiple ApplicationsApplications

• Running Multiple Instances Running Multiple Instances of the Same Applicationof the Same Application

• Running Parallel Running Parallel ApplicationsApplications

Slide 9 - 03.05.23

Single System ImageSingle System Image

• One big advantage of SMPs is the Single System One big advantage of SMPs is the Single System ImageImage– Easier Administration and SupportEasier Administration and Support– But, Single Point of FailureBut, Single Point of Failure

• Scali´s ”Universe” offers Single System Image Scali´s ”Universe” offers Single System Image to the Administrators and Usersto the Administrators and Users– As Easy to Use and Support as an SMPAs Easy to Use and Support as an SMP– No Single Point of Failure (N-copies of the same OS)No Single Point of Failure (N-copies of the same OS)– Redundancy in ”Universe” ArchitectureRedundancy in ”Universe” Architecture

Slide 10 - 03.05.23

Clustering makes Clustering makes Mo(o)re SenseMo(o)re Sense

• Microprocessor Performance Increases 50-60% Microprocessor Performance Increases 50-60% per Yearper Year– 1 year lag:1 year lag: 1.0 WS = 1.6 Proprietary Units1.0 WS = 1.6 Proprietary Units– 2 year lag:2 year lag: 1.0 WS = 2.6 Proprietary Units1.0 WS = 2.6 Proprietary Units

• Volume DisadvantageVolume Disadvantage– When Volume Doubles, Cost is reduced to 90%When Volume Doubles, Cost is reduced to 90%– 1,000 Proprietary Units vs 1,000,000 SHV units=> 1,000 Proprietary Units vs 1,000,000 SHV units=>

Proprietary Unit 3 X more ExpensiveProprietary Unit 3 X more Expensive• 2 years lag and 1:100 Volume Disadvantage => 7 2 years lag and 1:100 Volume Disadvantage => 7

X Worse Price/PerformanceX Worse Price/Performance

Slide 11 - 03.05.23

Why Do We Need SMPs?Why Do We Need SMPs?

• Small SMPs make Great Nodes for Small SMPs make Great Nodes for building Clusters!building Clusters!

• The most Cost-Effective Cluster Node is a The most Cost-Effective Cluster Node is a Dual Processor SMPDual Processor SMP

Slide 12 - 03.05.23

MissionMission

Scali is dedicated to makingScali is dedicated to makingState-of-the-art MiddlewareState-of-the-art Middleware

AndAndSystem Management SoftwareSystem Management SoftwareThe key enabling SW technologies The key enabling SW technologies

for buildingfor buildingScalable SystemsScalable Systems

Slide 13 - 03.05.23

Application AreasApplication Areas

ISP´sASP´s

InterconnectPC Technology Linux OS

Scali Software

Basic Technologies

Scalable Systems

DepartmentalServers

E-commerce/Databases

Slide 14 - 03.05.23

Platform AttractionPlatform Attraction

Totalview

TimeScanVampir

PGI

GUI

System Monitoring

Config. Mngmnt

ICM

DQS ScaMPI

Slide 15 - 03.05.23

TechnologyTechnology

• High Performance implementation of MPIHigh Performance implementation of MPI• ICM - InterConnect Manager for SCI ICM - InterConnect Manager for SCI • Parallel Systems configuration serverParallel Systems configuration server• Parallel Systems monitoringParallel Systems monitoring• Expert knowledge in Expert knowledge in

– Computer ArchitectureComputer Architecture– Processor and Communication hardwareProcessor and Communication hardware– Software design and developmentSoftware design and development– ParallelizationParallelization– System integration and packagingSystem integration and packaging HardwareHardware

Operating SystemOperating System ICMICM

ApplicationApplicationConf.Conf.serverserver

System System MonitorMonitor

MPIMPI

Sys Adm GUISys Adm GUI

Slide 16 - 03.05.23

Key FactorsKey Factors

• High Performance Systems NeedHigh Performance Systems Need– High Processor SpeedHigh Processor Speed– High Bandwidth InterconnectHigh Bandwidth Interconnect– Low latency CommunicationLow latency Communication

• Balanced ResourcesBalanced Resources• Economy of Scale ComponentsEconomy of Scale Components• Establishes a new Standard for Establishes a new Standard for

Price/PerformancePrice/Performance

Slide 17 - 03.05.23

Software Design StrategySoftware Design Strategy

• Client - Server ArchitectureClient - Server Architecture• Implemented asImplemented as

– Application level modulesApplication level modules– LibrariesLibraries– DaemonsDaemons– ScriptsScripts

• No OS modificationsNo OS modifications

Slide 18 - 03.05.23

AdvantagesAdvantages

• Industry Standard Industry Standard Programming Model - MPIProgramming Model - MPI– MPICH CompatibleMPICH Compatible

• Lower CostLower Cost– COTS based Hardware = COTS based Hardware =

lower system pricelower system price– Lower Total Cost of Lower Total Cost of

OwnershipOwnership• Better PerformanceBetter Performance

– Always ”Latest & Greatest” Always ”Latest & Greatest” ProcessorsProcessors

– Superior Standard Superior Standard Interconnect - SCIInterconnect - SCI

• ScalabilityScalability– Scalable to hundreds of Scalable to hundreds of

ProcessorsProcessors• RedundancyRedundancy• Single System Image to Single System Image to

users and administratorusers and administrator• Choice of OSChoice of OS

– LinuxLinux– SolarisSolaris– Windows NTWindows NT

Slide 19 - 03.05.23

Scali MPI - Unique FeaturesScali MPI - Unique Features

• Fault TolerantFault Tolerant• High BandwidthHigh Bandwidth• Low LatencyLow Latency• Multi-Thread safeMulti-Thread safe• Simultaneous Inter/-Simultaneous Inter/-

Intra-node operationIntra-node operation• UNIX command line UNIX command line

replicatedreplicated

• Exact message size optionExact message size option• Manual/debugger mode for Manual/debugger mode for

selected processesselected processes• Explicit host specificationExplicit host specification• Job queuingJob queuing

– PBS, DQS, LSF, CCS, NQS, MauiPBS, DQS, LSF, CCS, NQS, Maui• Conformance to MPI-1.1 Conformance to MPI-1.1

verified through 1665 MPI verified through 1665 MPI teststests

Slide 20 - 03.05.23

1,00

3,00

5,00

7,00

9,00

11,00

13,00

15,00

17,00

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Number of Processors

0 % Overlap2 % Overlap5 % Overlap10 % Overlap

Parallel Processing Parallel Processing ConstraintsConstraints

Communication

Computation P1P2P3P4

Initialization Processing Storing Results

Overlaps in Processing

Slide 21 - 03.05.23

System InterconnectSystem Interconnect

Main Interconnect:Main Interconnect:•Torus TopologyTorus Topology•SCI - IEEE/ANSI std. 1596SCI - IEEE/ANSI std. 1596•667MB/s/segment/ring667MB/s/segment/ring•Shared Address SpaceShared Address Space

Maintenance and LAN Maintenance and LAN Interconnect:Interconnect:•100Mbit/s Ethernet100Mbit/s Ethernet

Slide 22 - 03.05.23

2-D Torus Topology2-D Torus Topology

PSB

LC3 LC3

PCI-bus

B-Link

Horizontal SCI Ring

Vertical SCI Ring

Distributed Switching:

Slide 23 - 03.05.23

Scalability with 33MHz/32bit PCIScalability with 33MHz/32bit PCI

12

144

1728

0,1

1

10

100

1000

1 10 100 1000 10000Number of Nodes

Ringlet

2D-Torus

3D-Torus

4D-Torus

PCI

Slide 24 - 03.05.23

Scalability with 66MHz/64bits PCIScalability with 66MHz/64bits PCI

12

144

1728

0,1

1

10

100

1000

1 10 100 1000 10000Number of Nodes

Ringlet

2D-Torus

3D-Torus

4D-Torus

PCI

Slide 25 - 03.05.23

PaderbornPaderborn

PSC2PSC212 x 8 Torus 12 x 8 Torus 192 Processors192 Processors450MHz450MHz86.4GFlops86.4GFlops

PSC1PSC18 x 4 Torus8 x 4 Torus64 Processors64 Processors300MHz300MHz19.2GFlops19.2GFlops

Slide 26 - 03.05.23

MPI_Alltoall()MPI_Alltoall()

ScaMPI SustainedAccumulated MPI_Alltoall() bandwidth

0

500

1000

1500

2000

2500

2 4 8 16 32 64 96Number of nodes

MB

yte/

s

Slide 27 - 03.05.23

MPI_Barrier() MPI_Barrier()

ScaMPIMPI_Barrier() latency(arithmetic average)

0

10

20

30

40

2 4 8 16 32 48 64 96Number of nodes

Slide 28 - 03.05.23

Versus Myrinet (1)Versus Myrinet (1)

2 Node ping-pong performance

0

10

20

30

40

50

60

70

80

90

0 2 8 32 128 384 768 2k 6k 12k 24k 48k 96k 192k 384k 768k 1.5M 3M

Message size

Myrinet GM/MPICHDolphin SCI/ScaMPI

Slide 29 - 03.05.23


2 Node two-way performance

0

10

20

30

40

50

60

70

80

90

0 2 8 32 128 384 768 2k 6k 12k 24k 48k 96k 192k 384k 768k 1.5M 3M

Message size

Myrinet GM/MPICHDolphinSCI/ScaMPI

Slide 30 - 03.05.23


Barrier synchronization

0

20

40

60

80

100

120

140

160

180

200

2 4 8 9 16

Number of nodes

MPICH/Myrinet GM barrierScali MPI/SCI barrier

Slide 31 - 03.05.23


All-to-all performance

0

10

20

30

40

50

60

70

80

90

2 4 8 9 16

Number of nodes

MPICH/Myrinet GM all-to-all

Scali MPI/SCI all-to-all

Slide 32 - 03.05.23

Versus Origin 2000 (1)Versus Origin 2000 (1)

All-to-all Bandwidth per Node

0

20

40

60

80

100

120

2 4 8 9 16 25 32 36 49 64Number of Nodes

Origin2kScaMPI/SCI

Slide 33 - 03.05.23

Versus Origin 2000 (2)Versus Origin 2000 (2)

Barrier Synchronization

050

100150200250300350400450500

2 4 8 9 16 25 32 36 49 64Number of Nodes

Origin2kScaMPI/SCI

Slide 34 - 03.05.23

System ArchitectureSystem Architecture

3 4x4 2D Torus SCI clusterControl Node(Front-end)

GUI

SCI

RemoteWorkstation

GUI

CS

TCP/IP Socket

Server daemon

Node daemon

Slide 35 - 03.05.23

Fault ToleranceFault Tolerance

• 2D Torus topology2D Torus topology– more routing optionsmore routing options

• XY routing algorithmXY routing algorithm– Node 33 fails (3)Node 33 fails (3)– Nodes on 33’s ringlets Nodes on 33’s ringlets

becomes unavailablebecomes unavailable– Cluster fractured with Cluster fractured with

current routing settingcurrent routing setting

14 24 34 44

13 23 33 43

12 22 32 42

11 21 31 41

Slide 36 - 03.05.23


• Rerouting with XYRerouting with XY– Failed node Logically Failed node Logically

remapped to a cornerremapped to a corner– End-point ID’s End-point ID’s

unchangedunchanged– Applications can Applications can

continuecontinue• Problem:Problem:

– To many working To many working nodes unusednodes unused

43 13 23 33

42 12 22 32

41 11 21 31

44 14 24 34

Slide 37 - 03.05.23


• Scali advanced routing Scali advanced routing algorithm:algorithm:– From the Turn Model From the Turn Model

family of routing family of routing algorithmsalgorithms

• All nodes but the failed All nodes but the failed one can be utilised as one can be utilised as one big partitionone big partition

43 13 23 33

42 12 22 32

41 11 21 31

44 14 24 34

Slide 38 - 03.05.23

The Scali UniverseThe Scali Universe

Slide 39 - 03.05.23

System ManagementSystem Management

Slide 40 - 03.05.23

Software Configuration Software Configuration ManagementManagement

Nodes are Nodes are categorised once,categorised once,from then on, new from then on, new software is installed software is installed by one mouse Click, by one mouse Click, or with a single or with a single command.command.

Slide 41 - 03.05.23

System MonitoringSystem Monitoring

Slide 42 - 03.05.23

Products (1)Products (1)

• PlatformsPlatforms– Intel Ia32/LinuxIntel Ia32/Linux– Intel Ia32/SolarisIntel Ia32/Solaris– Alpha/LinuxAlpha/Linux– SPARC/SolarisSPARC/Solaris– Ia64/LinuxIa64/Linux

• MiddlewareMiddleware– MPI 1.1MPI 1.1– MPI 2MPI 2– IPIP– SANSAN– VIAVIA– Cray shmemCray shmem

Slide 43 - 03.05.23

Products (2)Products (2)

• ””TeraRack” PentiumTeraRack” Pentium– Each Rack:Each Rack:

• 36 x 1U Units36 x 1U Units– Dual PIII 800MHzDual PIII 800MHz

• 57.6GFlops57.6GFlops• 144GBytes SDRAM144GBytes SDRAM• 8.1TBytes Disk8.1TBytes Disk• Power SwitchesPower Switches• Console RoutersConsole Routers• 2-D Torus SCI2-D Torus SCI

Documents

Scalable Systems and Technology Einar Rustad Scali AS