View
216
Download
1
Embed Size (px)
Citation preview
HELICS
Petteri Johansson & Ilkka Uuhiniemi
HELICS
• COW– AMD Athlon MP 1.4Ghz– 512 (2 in same computing node)– 35 at top500.org– Linpack Benchmark 825 Gflops– COTS -> 1.3M EUROs
HELICS
• 256 GBytes ECC RAM
• 10 TB local disks
• Myrinet 2000 (fiber)
• 6 switches (128 port)
• Ethernet
• Peak performance 512*2.8GFlops = 1.43TFlops
Interconnections
– Myrinet 2000– 10 ns latency (one way) 2+2 Gbs Full duplex
bandwidth– bisectional bandwith: 128x (2+2) Gbs
Additional equipment
• 32 Double node Myrinet cluster for interactive development
• 2 Front End PC as access, compilation, job distribution hosts
• 1 Administration server• 1 Fileserver (Sun Fire 880) + 2 Tbyte Raid 5
diskarray• 10 Tbyte tape backup• remote power control device
Problems
• Hardware errors: 3 power supplies, 3 hard disks, 2 motherboards, 8 Myrinet network cards
• Software: Kernel 2.4.18 (stable), 2 nodes crash due to daemon crashes
Clustering
• What is needed?– Booting concept:
• Network boot (dhcp)
– cluster installation• installation via network
– power control• remote access of power supplies, seq. power off/on, reset
– BIOS control• update and setting via network, direct access via serial link
– health control of nodes• fan speed, cpu temp and disk status gathering via network
Clustering
• reliability of resources– spare hosts, redundant servers
• availability• monitoring & accounting
– gathering system+job status, accounting infos via network
• batching concepts– Score cluster software
Clustering
• application optimization– tracing + profiling tools (vampir, paraver)
• debugging of parallel applications– Debugger: Totalview, P2D2, PGI
Software
• SCore Cluster System Software is a high-performance parallel programming environment for workstation and PC clusters
SCORE
• Heterogeneous Programming Language
• Multiple Programming Paradigms
• Parallel Programming Support– Real-time process activity monitor– Deadlock detection– Automatic debugger attachment
SCORE
• Fault tolerance– Preemptive checkpoint– Parallel process migration
• Flexible Job Scheduling– Gang scheduling– Batch scheduling
USAGE
• Reactive flows
• Optimization problems
• Technical simulations
• Image processing
• Bio-computing/Bioinformatics