Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Architecture of Parallel ComputersCSC / ECE 506
BlueGene Architecture
Lecture 24
7/31/2006
Dr Steve Hunter
CSC / ECE 506 2Arch of Parallel Computers
• December 1999: IBM Research announced a 5 year, $100M US, effort to build a petaflop/s scale supercomputer to attack science problems such as protein folding. Goals:
– Advance the state of the art of scientific simulation.– Advance the state of the art in computer design and software for capability and
capacity markets.
• November 2001: Announced Research partnership with Lawrence Livermore National Laboratory (LLNL).November 2002: Announced planned acquisition of a BG/L machine by LLNL as part of the ASCI Purple contract.
• May 11, 2004: Four racks DD1 (4096 nodes at 500 MHz) running Linpack at 11.68 TFlops/s. It was ranked #4 on 23rd Top500 list.
• June 2, 2004: 2 racks DD2 (1024 nodes at 700 MHz) running Linpack at 8.655 TFlops/s. It was ranked #8 on 23rd Top500 list.
• September 16, 2004, 8 racks running Linpack at 36.01 TFlops/s.• November 8, 2004, 16 racks running Linpack at 70.72 TFlops/s. It was ranked #1 on
the 24th Top500 list.• December 21, 2004 First 16 racks of BG/L accepted by LLNL.
BlueGene/L Program
CSC / ECE 506 3Arch of Parallel Computers
BlueGene/L Program
• Massive collection of low-power CPUs instead of a moderate-sized collection of high-power CPUs.
– A joint development of IBM and DOE’s National Nuclear Security Administration (NNSA) and installed at DOE’s Lawrence Livermore National Laboratory
• BlueGene/L has occupied the No. 1 position on the last three TOP500 lists (http://www.top500.org/)
– It has reached a Linpack benchmark performance of 280.6 TFlop/s (“teraflops” or trillions of calculations per second) and still remains the only system ever to exceed the level of 100 TFlop/s.
– BlueGene holds the #1, #2, and #8 positions in top 10.
• “Objective was to retain exceptional cost/performance levels achieved by application-specific machines, while generalizing the massively parallel architecture enough to enable a relatively broad class of applications” - Overview of BG/L system architecture, IBM JRD
– Design approach was to use a very high level of integration that made simplicity in packaging, design, and bring-up possible
– JRD issue available at: http://www.research.ibm.com/journal/rd49-23.html
CSC / ECE 506 4Arch of Parallel Computers
BlueGene/L Program
• BlueGene is a family of supercomputers.– BlueGene/L is the first step, aimed as a multipurpose, massively parallel,
and cost/effective supercomputer 12/04
– BlueGene/P is the petaflop generation 12/06
– BlueGene/Q is the third generation ~2010.
• Requirements for future generations– Processors will be more powerful.
– Networks will be higher bandwidth.
– Applications developed on BlueGeneG/L will run well on BlueGene/P.
CSC / ECE 506 5Arch of Parallel Computers
• Low Complexity nodes gives more flops per transistor and per watt• 3D Interconnect supports many scientific simulations as nature as we see it is 3D
BlueGene/L Fundamentals
CSC / ECE 506 6Arch of Parallel Computers
BlueGene/L Fundamentals
• Cellular architecture– Large numbers of low power, more efficient processors interconnected
• Rmax of 280.6 Teraflops– Maximal LINPACK performance achieved
• Rpeak of 360 Teraflops– Theoretical peak performance
• 65,536 dual-processor compute nodes– 700MHz IBM PowerPC 440 processors– 512 MB memory per compute node, 16 TB in entire system.– 800 TB of disk space
• 2,500 square feet
CSC / ECE 506 7Arch of Parallel Computers
upercomputer Peak Performance
1940 1950 1960 1970 1980 1990 2000 2010
Year Introduced
1E+2
1E+5
1E+8
1E+11
1E+14
1E+17
Pea
k S
pee
d (
flo
ps)
Doubling time = 1.5 yr.
ENIAC (vacuum tubes)UNIVAC
IBM 701IBM 704
IBM 7090 (transistors)
IBM StretchCDC 6600 (ICs)
CDC 7600CDC STAR-100 (vectors) CRAY-1
Cyber 205 X-MP2 (parallel vectors)
CRAY-2X-MP4 Y-MP8
i860 (MPPs)
ASCI White, ASCI Q
PetaflopBlue Gene/L
Blue Pacific
DeltaCM-5 Paragon
NWT
ASCI Red OptionASCI Red
CP-PACS
Earth
VP2600/10SX-3/44
Red Storm
ILLIAC IV
SX-2
SX-4
SX-5
S-810/20
T3D
T3E
multi-Petaflop
Thunder
Comparing Systems (Peak)
CSC / ECE 506 8Arch of Parallel Computers
! Red Storm 2.0 2003! Earth Simulator 2.0 2002 ! Intel Paragon 1.8 1992! nCUBE/2 1.0 1990! ASCI Red 1.0 (0.6) 1997! T3E 0.8 1996! BG/L 1.5 0.75(torus)+0.75(tree) 2004! Cplant 0.1 1997! ASCI White 0.1 2000! ASCI Q 0.05 Quadrics 2003! ASCI Purple 0.1 2004! Intel Cluster 0.1 IB 2004! Intel Cluster 0.008 GbE 2003! Virginia Tech 0.16 IB 2003! Chinese Acad of Sc 0.04 QsNet 2003 ! NCSA - Dell 0.04 Myrinet 2003
Comparing Systems (Byte/Flop)
CSC / ECE 506 9Arch of Parallel Computers
• Power efficiencies of recent supercomputers
– Blue: IBM Machines
– Black: Other US Machines
– Red: Japanese MachinesIBM Journal of Research and Development
Comparing Systems (GFlops/Watt)
CSC / ECE 506 10Arch of Parallel Computers
7005001000375MHz
65,5366404096512# Nodes
100400200100Cost ($M)
1.56-8.53.81Power (MW)*
2,50034,00020,00010,000Footprint (sq ft)
3210338Total Mem. (TBytes)
36740.963012.3Machine Peak (TF/s)
Blue Gene/LEarth Simulator
ASCI QASCI White
Comparing Systems
* 10 megawatts approximate usage of 11,000 households
CSC / ECE 506 11Arch of Parallel Computers
BG/L Summary of Performance Results
• DGEMM (Double-precision, GEneral Matrix-Multiply): – 92.3% of dual core peak on 1 node – Observed performance at 500 MHz: 3.7 GFlops– Projected performance at 700 MHz: 5.2 GFlops (tested in lab up to 650 MHz)
• LINPACK:– 77% of peak on 1 node – 70% of peak on 512 nodes (1435 GFlops at 500 MHz)
• sPPM (Spare Matrix Multiple Vector Multiply), UMT2000:– Single processor performance roughly on par with POWER3 at 375 MHz– Tested on up to 128 nodes (also NAS Parallel Benchmarks)
• FFT (Fast Fourier Transform):– Up to 508 MFlops on single processor at 444 MHz (TU Vienna)– Pseudo-ops performance (5N log N) @ 700 MHz of 1300 Mflops (65% of peak)
• STREAM – impressive results even at 444 MHz:– Tuned: Copy: 2.4 GB/s, Scale: 2.1 GB/s, Add: 1.8 GB/s, Triad: 1.9 GB/s – Standard: Copy: 1.2 GB/s, Scale: 1.1 GB/s, Add: 1.2 GB/s, Triad: 1.2 GB/s– At 700 MHz: Would beat STREAM numbers for most high end microprocessors
• MPI:– Latency – < 4000 cycles (5.5 ✙✙✙✙s at 700 MHz)
– Bandwidth – full link bandwidth demonstrated on up to 6 links
CSC / ECE 506 12Arch of Parallel Computers
BlueGene/L Architecture
• To achieve this level of integration, the machine was developed around a processor with moderate frequency, available in system-on-a-chip (SoC) technology
– This approach was chosen because of the performance/power advantage
– In terms of performance/watt the low-frequency, low-power, embedded IBM PowerPC core consistently outperforms high-frequency, high-power, microprocessors by a factor of 2 to 10
– Industry focus on performance / rack
» Performance / rack = Performance / watt * Watt / rack
» Watt / rack = 20kW for power and thermal cooling reasons
• Power and cooling
– Using conventional techniques, a 360 Tflops machine would require 10-20 megawatts.
– BlueGene/L uses only 1.76 megawatts
CSC / ECE 506 13Arch of Parallel Computers
Microprocessor Power Density Growth
CSC / ECE 506 14Arch of Parallel Computers
System Power Comparison
BG/L 2048 processors
20.1 kW
450 Thinkpads
(LS Mok,4/2002)
20.3 kW
CSC / ECE 506 15Arch of Parallel Computers
BlueGene/L Architecture
• Networks were chosen with extreme scaling in mind
– Scale efficiently in terms of both performance and packaging
– Support very small messages
» As small as 32 bytes
– Includes hardware support for collective operations
» Broadcast, reduction, scan, etc.
• Reliability, Availability and Serviceability (RAS) is another critical issue for scaling
– BG/L need to be reliable and usable even at extreme scaling limits
– 20 fails per 1,000,000,000 hours = 1 node failure every 4.5 weeks
• System Software and Monitoring also important to scaling
– BG/L designed to efficiently utilize a distributed memory, message-passing programming model
– MPI is the dominant message-passing model with hardware features added and parameter tuned
CSC / ECE 506 16Arch of Parallel Computers
• System designed for RAS from top to bottom– System issues
» Redundant bulk supplies, power converters, fans, DRAM bits, cable bits
» Extensive data logging (voltage, temp, recoverable errors … ) for failure forecasting
» Nearly no single points of failure
– Chip design
» ECC on all SRAMs
» All dataflow outside processors is protected by error-detection mechanisms
» Access to all state via noninvasive back door
– Low power, simple design leads to higher reliability
– All interconnects have multiple error detections and correction coverage
» Virtually zero escape probability for link errors
RAS (Reliability, Availability, Serviceability)
CSC / ECE 506 17Arch of Parallel Computers
136.8 Teraflop/s on LINPACK (64K processors)1 TF = 1000,000,000,000 FlopsRochester Lab 2005
BlueGene/L System
CSC / ECE 506 18Arch of Parallel Computers
BlueGene/L System
CSC / ECE 506 19Arch of Parallel Computers
BlueGene/L System
CSC / ECE 506 20Arch of Parallel Computers
BlueGene/L System
CSC / ECE 506 21Arch of Parallel Computers
Physical Layout of BG/L
CSC / ECE 506 22Arch of Parallel Computers
Midplanes and Racks
CSC / ECE 506 23Arch of Parallel Computers
The Compute Chip
• System-on-a-chip (SoC)• 1 ASIC
– 2 PowerPC processors– L1 and L2 Caches– 4MB embedded DRAM– DDR DRAM interface and DMA
controller– Network connectivity hardware– Control / monitoring equip. (JTAG)
CSC / ECE 506 24Arch of Parallel Computers
Compute Card
CSC / ECE 506 25Arch of Parallel Computers
Node Card
CSC / ECE 506 26Arch of Parallel Computers
BlueGene/L Compute ASIC
PLB (4:1)
“Double FPU”
Ethernet Gbit
JTAGAccess
144 bit wide DDR256/512MB
JTAG
Gbit Ethernet
440 CPU
440 CPUI/O proc
L2
L2
MultiportedSharedSRAM Buffer
Torus
DDR Control with ECC
SharedL3 directoryfor EDRAM
Includes ECC
4MB EDRAM
L3 CacheorMemory
6 out and6 in, each at 1.4 Gbit/s link
256
256
1024+144 ECC256
128
128
32k/32k L1
32k/32k L1
“Double FPU”
256
snoop
Tree
3 out and3 in, each at 2.8 Gbit/s link
GlobalInterrupt
4 global barriers orinterrupts
128
• IBM CU-11, 0.13 µm• 11 x 11 mm die size• 25 x 32 mm CBGA• 474 pins, 328 signal• 1.5/2.5 Volt
CSC / ECE 506 27Arch of Parallel Computers
3 Dimensional Torus– Main network, for point-to-point communication– High-speed, high-bandwidth– Interconnects all compute nodes (65,536)– Virtual cut-through hardware routing– 1.4Gb/s on all 12 node links (2.1 GB/s per node)– 1 µs latency between nearest neighbors, 5 µs to the farthest– 4 µs latency for one hop with MPI, 10 µs to the farthest– Communications backbone for computations– 0.7/1.4 TB/s bisection bandwidth, 68TB/s total bandwidth
Global Tree– One-to-all broadcast functionality– Reduction operations functionality– MPI collective ops in hardware– Fixed-size 256 byte packets– 2.8 Gb/s of bandwidth per link– Latency of one way tree traversal 2.5 µs – ~23TB/s total binary tree bandwidth (64k machine)– Interconnects all compute and I/O nodes (1024)– Also guarantees reliable delivery
Ethernet– Incorporated into every node ASIC– Active in the I/O nodes (1:64)– All external comm. (file I/O, control, user interaction, etc.)
Low Latency Global Barrier and Interrupt– Latency of round trip 1.3 µs
Control Network
BlueGene/L Interconnect Networks
CSC / ECE 506 28Arch of Parallel Computers
The Torus Network
• 3 dimensional: 64 x 32 x 32– Each compute node is connected to its six neighbors: x+, x-, y+, y-, z+, z-– Compute card is 1x2x1– Node card is 4x4x2– 16 compute cards in 4x2x2 arrangement– Midplane is 8x8x8– 16 node cards in 2x2x4 arrangement
• Communication path– Each uni-directional link is 1.4Gb/s, or 175MB/s.– Each node can send and receive at 1.05GB/s.– Supports cut-through routing, along with both deterministic and adaptive
routing.– Variable-sized packets of 32,64,96…256 bytes– Guarantees reliable delivery
CSC / ECE 506 29Arch of Parallel Computers
Complete BlueGene/L System at LLNL
BG/Lcomputenodes65,536
BG/LI/O nodes
1,024
Fed
erat
ed G
igab
it E
ther
net S
witc
h2,
048
port
s
Front-end nodes
Service node
WAN
visualization
archive
CWFS
8
8
Control network
8
512
128
64
48
1024
CSC / ECE 506 30Arch of Parallel Computers
System Software Overview
• Operating system - Linux
• Compilers - IBM XL C, C++, Fortran95
• Communication - MPI, TCP/IP
• Parallel File System - GPFS, NFS support
• System Management - extensions to CSM
• Job scheduling - based on LoadLeveler
• Math libraries - ESSL
CSC / ECE 506 31Arch of Parallel Computers
BG/L Software Hierarchical Organization
• Compute nodes dedicated to running user application, and almost nothing else - simple compute node kernel (CNK)
• I/O nodes run Linux and provide a more complete range of OS services – files, sockets, process launch, signaling, debugging, and termination
• Service node performs system management services (e.g., heart beating, monitoring errors) - transparent to application software
CSC / ECE 506 32Arch of Parallel Computers
BG/L System Software
• Simplicity
– Space-sharing
– Single-threaded
– No demand paging
• Familiarity
– MPI (MPICH2)
– IBM XL Compilers for PowerPC
CSC / ECE 506 33Arch of Parallel Computers
Operating Systems
• Front-end nodes are commodity systems running Linux
• I/O nodes run a customized Linux kernel
• Compute nodes use an extremely lightweight custom kernel
• Service node is a single multiprocessor machine running a custom OS
CSC / ECE 506 34Arch of Parallel Computers
Compute Node Kernel (CNK)
• Single user, dual-threaded
• Flat address space, no paging
• Physical resources are memory-mapped
• Provides standard POSIX functionality (mostly)
• Two execution modes:
– Virtual node mode
– Coprocessor mode
CSC / ECE 506 35Arch of Parallel Computers
Service Node OS
• Core Management and Control System (CMCS)
• BG/L’s “global” operating system.
• MMCS - Midplane Monitoring and Control System
• CIOMAN - Control and I/O Manager
• DB2 relational database
CSC / ECE 506 36Arch of Parallel Computers
Running a User Job
• Compiled, and submitted from front-end node.
• External scheduler
• Service node sets up partition, and transfers user’s code to compute nodes.
• All file I/O is done using standard Unix calls (via the I/O nodes).
• Post-facto debugging done on front-end nodes.
CSC / ECE 506 37Arch of Parallel Computers
Performance Issues
• User code is easily ported to BG/L.
• However, MPI implementation requires effort & skill
– Torus topology instead of crossbar
– Special hardware, such as collective network.
CSC / ECE 506 38Arch of Parallel Computers
BG/L MPI Software Architecture
GI = Global InterruptCIO = Control and I/O ProtocolCH3 = Primary device distributed with
MPICH2 communicationMPD = Multipurpose Daemon
CSC / ECE 506 39Arch of Parallel Computers
MPI_Bcast
CSC / ECE 506 40Arch of Parallel Computers
MPI_Alltoall
CSC / ECE 506 41Arch of Parallel Computers
References
• IBM Journal of Research and Development, Vol. 49, No. 2-3.– http://www.research.ibm.com/journal/rd49-23.html
» “Overview of the Blue Gene/L system architecture”» “Packaging the Blue Gene/L supercomputer”» “Blue Gene/L compute chip: Memory and Ethernet subsystems”» “Blue Gene/L torus interconnection network”» “Blue Gene/L programming and operating environment”» “Design and implementation of message-passing services for the
Blue Gene/L supercomputer”
CSC / ECE 506 42Arch of Parallel Computers
References (cont.)
• BG/L homepage @ LLNL: <http://www.llnl.gov/ASC/platforms/bluegenel/>
• BlueGene homepage @ IBM: <http://www.research.ibm.com/bluegene/>
CSC / ECE 506 43Arch of Parallel Computers
The End