Upload
kory-gardner
View
234
Download
3
Embed Size (px)
DESCRIPTION
Slide Increasing Performance Faster ProcessorsFaster Processors –Frequency –Instruction Level Parallelism (ILP) Better AlgorithmsBetter Algorithms –Compilers –Manpower Parallel ProcessingParallel Processing –Compilers –Tools (Profilers, Debuggers) –More Manpower
Citation preview
Scalable SystemsScalable Systems and and
TechnologyTechnologyEinar Rustad
Scali [email protected]
http://www.scali.com
Slide 2 - 03.05.23
Definition of ClusterDefinition of Cluster
• The Widest Definition:The Widest Definition:– Any number of computers communicating at Any number of computers communicating at
any distanceany distance• The Common Definition:The Common Definition:
– A relatively small number of computers A relatively small number of computers (<1000) communicating at a relatively small (<1000) communicating at a relatively small distance (within the same room) and used asdistance (within the same room) and used asa single, shared computing resourcea single, shared computing resource
Slide 3 - 03.05.23
Increasing PerformanceIncreasing Performance
• Faster ProcessorsFaster Processors– Frequency Frequency – Instruction Level Parallelism (ILP)Instruction Level Parallelism (ILP)
• Better AlgorithmsBetter Algorithms– Compilers Compilers – ManpowerManpower
• Parallel ProcessingParallel Processing– CompilersCompilers– Tools (Profilers, Debuggers) Tools (Profilers, Debuggers) – More ManpowerMore Manpower
Slide 4 - 03.05.23
Use of ClustersUse of Clusters
• Capacity ServersCapacity Servers– Data BasesData Bases– Client/Server ComputingClient/Server Computing
• Throughput ServersThroughput Servers– Numerical ApplicationsNumerical Applications– Simulation and ModellingSimulation and Modelling
• High Availability ServersHigh Availability Servers– Transaction ProcessingTransaction Processing
Slide 5 - 03.05.23
Why ClusteringWhy Clustering
• Scaling of ResourcesScaling of Resources• Sharing of ResourcesSharing of Resources• Best Price/Performance Ratio (PPR)Best Price/Performance Ratio (PPR)
– PPR is Constant with Growing System SizePPR is Constant with Growing System Size• FlexibilityFlexibility• High AvailabilityHigh Availability• Fault ResilienceFault Resilience
Slide 6 - 03.05.23
Clusters vs SMPs (1)Clusters vs SMPs (1)
• ProgrammingProgramming– A Program written for Cluster Parallelism can A Program written for Cluster Parallelism can
run on an SMP right awayrun on an SMP right away– A Program written for an SMP can NOT run on A Program written for an SMP can NOT run on
a Cluster right awaya Cluster right away• ScalabilityScalability
– Clusters are ScalableClusters are Scalable– SMPs are NOT Scalable above a Small SMPs are NOT Scalable above a Small
Number of ProcessorsNumber of Processors
Slide 7 - 03.05.23
Why SMPs don´t scaleWhy SMPs don´t scale
CPU CPU CPU CPU
I/OMemory
CPU CPU CPU
Memory
CPU
Memory
Interconnect
L3CLink
L3CLink
This is an SMP This is NOT an SMP...
When CPUs cycle at 1GHz and Memory latency is >100nS, 1% Cache Miss implies <50% CPU Efficiency
But, You can make all the Memory Equally SLOW….( X-bar complexity grows with # of ports squared)
Slide 8 - 03.05.23
Clusters vs SMPs (2)Clusters vs SMPs (2)
Use of SMPsUse of SMPs• Common Access to Shared Common Access to Shared
ResourcesResources– ProcessorsProcessors– MemoryMemory– Storage DevicesStorage Devices
• Running Multiple Running Multiple ApplicationsApplications
• Running Multiple Instances Running Multiple Instances of the Same Applicationof the Same Application
• Running Parallel Running Parallel ApplicationsApplications
Use of ClustersUse of Clusters• Common Access to Shared Common Access to Shared
ResourcesResources– ProcessorsProcessors– Distributed MemoryDistributed Memory– Storage DevicesStorage Devices
• Running Multiple Running Multiple ApplicationsApplications
• Running Multiple Instances Running Multiple Instances of the Same Applicationof the Same Application
• Running Parallel Running Parallel ApplicationsApplications
Slide 9 - 03.05.23
Single System ImageSingle System Image
• One big advantage of SMPs is the Single System One big advantage of SMPs is the Single System ImageImage– Easier Administration and SupportEasier Administration and Support– But, Single Point of FailureBut, Single Point of Failure
• Scali´s ”Universe” offers Single System Image Scali´s ”Universe” offers Single System Image to the Administrators and Usersto the Administrators and Users– As Easy to Use and Support as an SMPAs Easy to Use and Support as an SMP– No Single Point of Failure (N-copies of the same OS)No Single Point of Failure (N-copies of the same OS)– Redundancy in ”Universe” ArchitectureRedundancy in ”Universe” Architecture
Slide 10 - 03.05.23
Clustering makes Clustering makes Mo(o)re SenseMo(o)re Sense
• Microprocessor Performance Increases 50-60% Microprocessor Performance Increases 50-60% per Yearper Year– 1 year lag:1 year lag: 1.0 WS = 1.6 Proprietary Units1.0 WS = 1.6 Proprietary Units– 2 year lag:2 year lag: 1.0 WS = 2.6 Proprietary Units1.0 WS = 2.6 Proprietary Units
• Volume DisadvantageVolume Disadvantage– When Volume Doubles, Cost is reduced to 90%When Volume Doubles, Cost is reduced to 90%– 1,000 Proprietary Units vs 1,000,000 SHV units=> 1,000 Proprietary Units vs 1,000,000 SHV units=>
Proprietary Unit 3 X more ExpensiveProprietary Unit 3 X more Expensive• 2 years lag and 1:100 Volume Disadvantage => 7 2 years lag and 1:100 Volume Disadvantage => 7
X Worse Price/PerformanceX Worse Price/Performance
Slide 11 - 03.05.23
Why Do We Need SMPs?Why Do We Need SMPs?
• Small SMPs make Great Nodes for Small SMPs make Great Nodes for building Clusters!building Clusters!
• The most Cost-Effective Cluster Node is a The most Cost-Effective Cluster Node is a Dual Processor SMPDual Processor SMP
Slide 12 - 03.05.23
MissionMission
Scali is dedicated to makingScali is dedicated to makingState-of-the-art MiddlewareState-of-the-art Middleware
AndAndSystem Management SoftwareSystem Management SoftwareThe key enabling SW technologies The key enabling SW technologies
for buildingfor buildingScalable SystemsScalable Systems
Slide 13 - 03.05.23
Application AreasApplication Areas
ISP´sASP´s
InterconnectPC Technology Linux OS
Scali Software
Basic Technologies
Scalable Systems
DepartmentalServers
E-commerce/Databases
Slide 14 - 03.05.23
Platform AttractionPlatform Attraction
Totalview
TimeScanVampir
PGI
GUI
System Monitoring
Config. Mngmnt
ICM
DQS ScaMPI
Slide 15 - 03.05.23
TechnologyTechnology
• High Performance implementation of MPIHigh Performance implementation of MPI• ICM - InterConnect Manager for SCI ICM - InterConnect Manager for SCI • Parallel Systems configuration serverParallel Systems configuration server• Parallel Systems monitoringParallel Systems monitoring• Expert knowledge in Expert knowledge in
– Computer ArchitectureComputer Architecture– Processor and Communication hardwareProcessor and Communication hardware– Software design and developmentSoftware design and development– ParallelizationParallelization– System integration and packagingSystem integration and packaging HardwareHardware
Operating SystemOperating System ICMICM
ApplicationApplicationConf.Conf.serverserver
System System MonitorMonitor
MPIMPI
Sys Adm GUISys Adm GUI
Slide 16 - 03.05.23
Key FactorsKey Factors
• High Performance Systems NeedHigh Performance Systems Need– High Processor SpeedHigh Processor Speed– High Bandwidth InterconnectHigh Bandwidth Interconnect– Low latency CommunicationLow latency Communication
• Balanced ResourcesBalanced Resources• Economy of Scale ComponentsEconomy of Scale Components• Establishes a new Standard for Establishes a new Standard for
Price/PerformancePrice/Performance
Slide 17 - 03.05.23
Software Design StrategySoftware Design Strategy
• Client - Server ArchitectureClient - Server Architecture• Implemented asImplemented as
– Application level modulesApplication level modules– LibrariesLibraries– DaemonsDaemons– ScriptsScripts
• No OS modificationsNo OS modifications
Slide 18 - 03.05.23
AdvantagesAdvantages
• Industry Standard Industry Standard Programming Model - MPIProgramming Model - MPI– MPICH CompatibleMPICH Compatible
• Lower CostLower Cost– COTS based Hardware = COTS based Hardware =
lower system pricelower system price– Lower Total Cost of Lower Total Cost of
OwnershipOwnership• Better PerformanceBetter Performance
– Always ”Latest & Greatest” Always ”Latest & Greatest” ProcessorsProcessors
– Superior Standard Superior Standard Interconnect - SCIInterconnect - SCI
• ScalabilityScalability– Scalable to hundreds of Scalable to hundreds of
ProcessorsProcessors• RedundancyRedundancy• Single System Image to Single System Image to
users and administratorusers and administrator• Choice of OSChoice of OS
– LinuxLinux– SolarisSolaris– Windows NTWindows NT
Slide 19 - 03.05.23
Scali MPI - Unique FeaturesScali MPI - Unique Features
• Fault TolerantFault Tolerant• High BandwidthHigh Bandwidth• Low LatencyLow Latency• Multi-Thread safeMulti-Thread safe• Simultaneous Inter/-Simultaneous Inter/-
Intra-node operationIntra-node operation• UNIX command line UNIX command line
replicatedreplicated
• Exact message size optionExact message size option• Manual/debugger mode for Manual/debugger mode for
selected processesselected processes• Explicit host specificationExplicit host specification• Job queuingJob queuing
– PBS, DQS, LSF, CCS, NQS, MauiPBS, DQS, LSF, CCS, NQS, Maui• Conformance to MPI-1.1 Conformance to MPI-1.1
verified through 1665 MPI verified through 1665 MPI teststests
Slide 20 - 03.05.23
1,00
3,00
5,00
7,00
9,00
11,00
13,00
15,00
17,00
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Number of Processors
0 % Overlap2 % Overlap5 % Overlap10 % Overlap
Parallel Processing Parallel Processing ConstraintsConstraints
Communication
Computation P1P2P3P4
Initialization Processing Storing Results
Overlaps in Processing
Slide 21 - 03.05.23
System InterconnectSystem Interconnect
Main Interconnect:Main Interconnect:•Torus TopologyTorus Topology•SCI - IEEE/ANSI std. 1596SCI - IEEE/ANSI std. 1596•667MB/s/segment/ring667MB/s/segment/ring•Shared Address SpaceShared Address Space
Maintenance and LAN Maintenance and LAN Interconnect:Interconnect:•100Mbit/s Ethernet100Mbit/s Ethernet
Slide 22 - 03.05.23
2-D Torus Topology2-D Torus Topology
PSB
LC3 LC3
PCI-bus
B-Link
Horizontal SCI Ring
Vertical SCI Ring
Distributed Switching:
Slide 23 - 03.05.23
Scalability with 33MHz/32bit PCIScalability with 33MHz/32bit PCI
12
144
1728
0,1
1
10
100
1000
1 10 100 1000 10000Number of Nodes
Ringlet
2D-Torus
3D-Torus
4D-Torus
PCI
Slide 24 - 03.05.23
Scalability with 66MHz/64bits PCIScalability with 66MHz/64bits PCI
12
144
1728
0,1
1
10
100
1000
1 10 100 1000 10000Number of Nodes
Ringlet
2D-Torus
3D-Torus
4D-Torus
PCI
Slide 25 - 03.05.23
PaderbornPaderborn
PSC2PSC212 x 8 Torus 12 x 8 Torus 192 Processors192 Processors450MHz450MHz86.4GFlops86.4GFlops
PSC1PSC18 x 4 Torus8 x 4 Torus64 Processors64 Processors300MHz300MHz19.2GFlops19.2GFlops
Slide 26 - 03.05.23
MPI_Alltoall()MPI_Alltoall()
ScaMPI SustainedAccumulated MPI_Alltoall() bandwidth
0
500
1000
1500
2000
2500
2 4 8 16 32 64 96Number of nodes
MB
yte/
s
Slide 27 - 03.05.23
MPI_Barrier() MPI_Barrier()
ScaMPIMPI_Barrier() latency(arithmetic average)
0
10
20
30
40
2 4 8 16 32 48 64 96Number of nodes
Slide 28 - 03.05.23
Versus Myrinet (1)Versus Myrinet (1)
2 Node ping-pong performance
0
10
20
30
40
50
60
70
80
90
0 2 8 32 128 384 768 2k 6k 12k 24k 48k 96k 192k 384k 768k 1.5M 3M
Message size
Myrinet GM/MPICHDolphin SCI/ScaMPI
Slide 29 - 03.05.23
Versus Myrinet (2)Versus Myrinet (2)
2 Node two-way performance
0
10
20
30
40
50
60
70
80
90
0 2 8 32 128 384 768 2k 6k 12k 24k 48k 96k 192k 384k 768k 1.5M 3M
Message size
Myrinet GM/MPICHDolphinSCI/ScaMPI
Slide 30 - 03.05.23
Versus Myrinet (3)Versus Myrinet (3)
Barrier synchronization
0
20
40
60
80
100
120
140
160
180
200
2 4 8 9 16
Number of nodes
MPICH/Myrinet GM barrierScali MPI/SCI barrier
Slide 31 - 03.05.23
Versus Myrinet (4)Versus Myrinet (4)
All-to-all performance
0
10
20
30
40
50
60
70
80
90
2 4 8 9 16
Number of nodes
MPICH/Myrinet GM all-to-all
Scali MPI/SCI all-to-all
Slide 32 - 03.05.23
Versus Origin 2000 (1)Versus Origin 2000 (1)
All-to-all Bandwidth per Node
0
20
40
60
80
100
120
2 4 8 9 16 25 32 36 49 64Number of Nodes
Origin2kScaMPI/SCI
Slide 33 - 03.05.23
Versus Origin 2000 (2)Versus Origin 2000 (2)
Barrier Synchronization
050
100150200250300350400450500
2 4 8 9 16 25 32 36 49 64Number of Nodes
Origin2kScaMPI/SCI
Slide 34 - 03.05.23
System ArchitectureSystem Architecture
3 4x4 2D Torus SCI clusterControl Node(Front-end)
GUI
SCI
RemoteWorkstation
GUI
CS
TCP/IP Socket
Server daemon
Node daemon
Slide 35 - 03.05.23
Fault ToleranceFault Tolerance
• 2D Torus topology2D Torus topology– more routing optionsmore routing options
• XY routing algorithmXY routing algorithm– Node 33 fails (3)Node 33 fails (3)– Nodes on 33’s ringlets Nodes on 33’s ringlets
becomes unavailablebecomes unavailable– Cluster fractured with Cluster fractured with
current routing settingcurrent routing setting
14 24 34 44
13 23 33 43
12 22 32 42
11 21 31 41
Slide 36 - 03.05.23
Fault ToleranceFault Tolerance
• Rerouting with XYRerouting with XY– Failed node Logically Failed node Logically
remapped to a cornerremapped to a corner– End-point ID’s End-point ID’s
unchangedunchanged– Applications can Applications can
continuecontinue• Problem:Problem:
– To many working To many working nodes unusednodes unused
43 13 23 33
42 12 22 32
41 11 21 31
44 14 24 34
Slide 37 - 03.05.23
Fault ToleranceFault Tolerance
• Scali advanced routing Scali advanced routing algorithm:algorithm:– From the Turn Model From the Turn Model
family of routing family of routing algorithmsalgorithms
• All nodes but the failed All nodes but the failed one can be utilised as one can be utilised as one big partitionone big partition
43 13 23 33
42 12 22 32
41 11 21 31
44 14 24 34
Slide 38 - 03.05.23
The Scali UniverseThe Scali Universe
Slide 39 - 03.05.23
System ManagementSystem Management
Slide 40 - 03.05.23
Software Configuration Software Configuration ManagementManagement
Nodes are Nodes are categorised once,categorised once,from then on, new from then on, new software is installed software is installed by one mouse Click, by one mouse Click, or with a single or with a single command.command.
Slide 41 - 03.05.23
System MonitoringSystem Monitoring
Slide 42 - 03.05.23
Products (1)Products (1)
• PlatformsPlatforms– Intel Ia32/LinuxIntel Ia32/Linux– Intel Ia32/SolarisIntel Ia32/Solaris– Alpha/LinuxAlpha/Linux– SPARC/SolarisSPARC/Solaris– Ia64/LinuxIa64/Linux
• MiddlewareMiddleware– MPI 1.1MPI 1.1– MPI 2MPI 2– IPIP– SANSAN– VIAVIA– Cray shmemCray shmem
Slide 43 - 03.05.23
Products (2)Products (2)
• ””TeraRack” PentiumTeraRack” Pentium– Each Rack:Each Rack:
• 36 x 1U Units36 x 1U Units– Dual PIII 800MHzDual PIII 800MHz
• 57.6GFlops57.6GFlops• 144GBytes SDRAM144GBytes SDRAM• 8.1TBytes Disk8.1TBytes Disk• Power SwitchesPower Switches• Console RoutersConsole Routers• 2-D Torus SCI2-D Torus SCI