Architecture of Parallel Computers CSC / ECE 506 BlueGene ...Architecture of Parallel Computers CSC / ECE 506 BlueGene Architecture Lecture 24 7/31/2006 Dr Steve Hunter. ... • sPPM

Architecture of Parallel ComputersCSC / ECE 506

BlueGene Architecture

Lecture 24

7/31/2006

Dr Steve Hunter

CSC / ECE 506 2Arch of Parallel Computers

• December 1999: IBM Research announced a 5 year, $100M US, effort to build a petaflop/s scale supercomputer to attack science problems such as protein folding. Goals:

– Advance the state of the art of scientific simulation.– Advance the state of the art in computer design and software for capability and

capacity markets.

• November 2001: Announced Research partnership with Lawrence Livermore National Laboratory (LLNL).November 2002: Announced planned acquisition of a BG/L machine by LLNL as part of the ASCI Purple contract.

• May 11, 2004: Four racks DD1 (4096 nodes at 500 MHz) running Linpack at 11.68 TFlops/s. It was ranked #4 on 23rd Top500 list.

• June 2, 2004: 2 racks DD2 (1024 nodes at 700 MHz) running Linpack at 8.655 TFlops/s. It was ranked #8 on 23rd Top500 list.

• September 16, 2004, 8 racks running Linpack at 36.01 TFlops/s.• November 8, 2004, 16 racks running Linpack at 70.72 TFlops/s. It was ranked #1 on

the 24th Top500 list.• December 21, 2004 First 16 racks of BG/L accepted by LLNL.

BlueGene/L Program


BlueGene/L Program

• Massive collection of low-power CPUs instead of a moderate-sized collection of high-power CPUs.

– A joint development of IBM and DOE’s National Nuclear Security Administration (NNSA) and installed at DOE’s Lawrence Livermore National Laboratory

• BlueGene/L has occupied the No. 1 position on the last three TOP500 lists (http://www.top500.org/)

– It has reached a Linpack benchmark performance of 280.6 TFlop/s (“teraflops” or trillions of calculations per second) and still remains the only system ever to exceed the level of 100 TFlop/s.

– BlueGene holds the #1, #2, and #8 positions in top 10.

• “Objective was to retain exceptional cost/performance levels achieved by application-specific machines, while generalizing the massively parallel architecture enough to enable a relatively broad class of applications” - Overview of BG/L system architecture, IBM JRD

– Design approach was to use a very high level of integration that made simplicity in packaging, design, and bring-up possible

– JRD issue available at: http://www.research.ibm.com/journal/rd49-23.html


BlueGene/L Program

• BlueGene is a family of supercomputers.– BlueGene/L is the first step, aimed as a multipurpose, massively parallel,

and cost/effective supercomputer 12/04

– BlueGene/P is the petaflop generation 12/06

– BlueGene/Q is the third generation ~2010.

• Requirements for future generations– Processors will be more powerful.

– Networks will be higher bandwidth.

– Applications developed on BlueGeneG/L will run well on BlueGene/P.


• Low Complexity nodes gives more flops per transistor and per watt• 3D Interconnect supports many scientific simulations as nature as we see it is 3D

BlueGene/L Fundamentals


BlueGene/L Fundamentals

• Cellular architecture– Large numbers of low power, more efficient processors interconnected

• Rmax of 280.6 Teraflops– Maximal LINPACK performance achieved

• Rpeak of 360 Teraflops– Theoretical peak performance

• 65,536 dual-processor compute nodes– 700MHz IBM PowerPC 440 processors– 512 MB memory per compute node, 16 TB in entire system.– 800 TB of disk space

• 2,500 square feet


upercomputer Peak Performance

1940 1950 1960 1970 1980 1990 2000 2010

Year Introduced

1E+2

1E+5

1E+8

1E+11

1E+14

1E+17

Pea

k S

pee

d (

flo

ps)

Doubling time = 1.5 yr.

ENIAC (vacuum tubes)UNIVAC

IBM 701IBM 704

IBM 7090 (transistors)

IBM StretchCDC 6600 (ICs)

CDC 7600CDC STAR-100 (vectors) CRAY-1

Cyber 205 X-MP2 (parallel vectors)

CRAY-2X-MP4 Y-MP8

i860 (MPPs)

ASCI White, ASCI Q

PetaflopBlue Gene/L

Blue Pacific

DeltaCM-5 Paragon

NWT

ASCI Red OptionASCI Red

CP-PACS

Earth

VP2600/10SX-3/44

Red Storm

ILLIAC IV

SX-2

SX-4

SX-5

S-810/20

T3D

T3E

multi-Petaflop

Thunder

Comparing Systems (Peak)


! Red Storm 2.0 2003! Earth Simulator 2.0 2002 ! Intel Paragon 1.8 1992! nCUBE/2 1.0 1990! ASCI Red 1.0 (0.6) 1997! T3E 0.8 1996! BG/L 1.5 0.75(torus)+0.75(tree) 2004! Cplant 0.1 1997! ASCI White 0.1 2000! ASCI Q 0.05 Quadrics 2003! ASCI Purple 0.1 2004! Intel Cluster 0.1 IB 2004! Intel Cluster 0.008 GbE 2003! Virginia Tech 0.16 IB 2003! Chinese Acad of Sc 0.04 QsNet 2003 ! NCSA - Dell 0.04 Myrinet 2003

Comparing Systems (Byte/Flop)


• Power efficiencies of recent supercomputers

– Blue: IBM Machines

– Black: Other US Machines

– Red: Japanese MachinesIBM Journal of Research and Development

Comparing Systems (GFlops/Watt)


7005001000375MHz

65,5366404096512# Nodes

100400200100Cost ($M)

1.56-8.53.81Power (MW)*

2,50034,00020,00010,000Footprint (sq ft)

3210338Total Mem. (TBytes)

36740.963012.3Machine Peak (TF/s)

Blue Gene/LEarth Simulator

ASCI QASCI White

Comparing Systems

* 10 megawatts approximate usage of 11,000 households


BG/L Summary of Performance Results

• DGEMM (Double-precision, GEneral Matrix-Multiply): – 92.3% of dual core peak on 1 node – Observed performance at 500 MHz: 3.7 GFlops– Projected performance at 700 MHz: 5.2 GFlops (tested in lab up to 650 MHz)

• LINPACK:– 77% of peak on 1 node – 70% of peak on 512 nodes (1435 GFlops at 500 MHz)

• sPPM (Spare Matrix Multiple Vector Multiply), UMT2000:– Single processor performance roughly on par with POWER3 at 375 MHz– Tested on up to 128 nodes (also NAS Parallel Benchmarks)

• FFT (Fast Fourier Transform):– Up to 508 MFlops on single processor at 444 MHz (TU Vienna)– Pseudo-ops performance (5N log N) @ 700 MHz of 1300 Mflops (65% of peak)

• STREAM – impressive results even at 444 MHz:– Tuned: Copy: 2.4 GB/s, Scale: 2.1 GB/s, Add: 1.8 GB/s, Triad: 1.9 GB/s – Standard: Copy: 1.2 GB/s, Scale: 1.1 GB/s, Add: 1.2 GB/s, Triad: 1.2 GB/s– At 700 MHz: Would beat STREAM numbers for most high end microprocessors

• MPI:– Latency – < 4000 cycles (5.5 ✙✙✙✙s at 700 MHz)

– Bandwidth – full link bandwidth demonstrated on up to 6 links


BlueGene/L Architecture

• To achieve this level of integration, the machine was developed around a processor with moderate frequency, available in system-on-a-chip (SoC) technology

– This approach was chosen because of the performance/power advantage

– In terms of performance/watt the low-frequency, low-power, embedded IBM PowerPC core consistently outperforms high-frequency, high-power, microprocessors by a factor of 2 to 10

– Industry focus on performance / rack

» Performance / rack = Performance / watt * Watt / rack

» Watt / rack = 20kW for power and thermal cooling reasons

• Power and cooling

– Using conventional techniques, a 360 Tflops machine would require 10-20 megawatts.

– BlueGene/L uses only 1.76 megawatts


Microprocessor Power Density Growth


System Power Comparison

BG/L 2048 processors

20.1 kW

450 Thinkpads

(LS Mok,4/2002)

20.3 kW


BlueGene/L Architecture

• Networks were chosen with extreme scaling in mind

– Scale efficiently in terms of both performance and packaging

– Support very small messages

» As small as 32 bytes

– Includes hardware support for collective operations

» Broadcast, reduction, scan, etc.

• Reliability, Availability and Serviceability (RAS) is another critical issue for scaling

– BG/L need to be reliable and usable even at extreme scaling limits

– 20 fails per 1,000,000,000 hours = 1 node failure every 4.5 weeks

• System Software and Monitoring also important to scaling

– BG/L designed to efficiently utilize a distributed memory, message-passing programming model

– MPI is the dominant message-passing model with hardware features added and parameter tuned


• System designed for RAS from top to bottom– System issues

» Redundant bulk supplies, power converters, fans, DRAM bits, cable bits

» Extensive data logging (voltage, temp, recoverable errors … ) for failure forecasting

» Nearly no single points of failure

– Chip design

» ECC on all SRAMs

» All dataflow outside processors is protected by error-detection mechanisms

» Access to all state via noninvasive back door

– Low power, simple design leads to higher reliability

– All interconnects have multiple error detections and correction coverage

» Virtually zero escape probability for link errors

RAS (Reliability, Availability, Serviceability)


136.8 Teraflop/s on LINPACK (64K processors)1 TF = 1000,000,000,000 FlopsRochester Lab 2005

BlueGene/L System


BlueGene/L System


BlueGene/L System


BlueGene/L System


Physical Layout of BG/L


Midplanes and Racks


The Compute Chip

• System-on-a-chip (SoC)• 1 ASIC

– 2 PowerPC processors– L1 and L2 Caches– 4MB embedded DRAM– DDR DRAM interface and DMA

controller– Network connectivity hardware– Control / monitoring equip. (JTAG)


Compute Card


Node Card


BlueGene/L Compute ASIC

PLB (4:1)

“Double FPU”

Ethernet Gbit

JTAGAccess

144 bit wide DDR256/512MB

JTAG

Gbit Ethernet

440 CPU

440 CPUI/O proc

L2

L2

MultiportedSharedSRAM Buffer

Torus

DDR Control with ECC

SharedL3 directoryfor EDRAM

Includes ECC

4MB EDRAM

L3 CacheorMemory

6 out and6 in, each at 1.4 Gbit/s link

256

256

1024+144 ECC256

128

128

32k/32k L1

32k/32k L1

“Double FPU”

256

snoop

Tree

3 out and3 in, each at 2.8 Gbit/s link

GlobalInterrupt

4 global barriers orinterrupts

128

• IBM CU-11, 0.13 µm• 11 x 11 mm die size• 25 x 32 mm CBGA• 474 pins, 328 signal• 1.5/2.5 Volt


3 Dimensional Torus– Main network, for point-to-point communication– High-speed, high-bandwidth– Interconnects all compute nodes (65,536)– Virtual cut-through hardware routing– 1.4Gb/s on all 12 node links (2.1 GB/s per node)– 1 µs latency between nearest neighbors, 5 µs to the farthest– 4 µs latency for one hop with MPI, 10 µs to the farthest– Communications backbone for computations– 0.7/1.4 TB/s bisection bandwidth, 68TB/s total bandwidth

Global Tree– One-to-all broadcast functionality– Reduction operations functionality– MPI collective ops in hardware– Fixed-size 256 byte packets– 2.8 Gb/s of bandwidth per link– Latency of one way tree traversal 2.5 µs – ~23TB/s total binary tree bandwidth (64k machine)– Interconnects all compute and I/O nodes (1024)– Also guarantees reliable delivery

Ethernet– Incorporated into every node ASIC– Active in the I/O nodes (1:64)– All external comm. (file I/O, control, user interaction, etc.)

Low Latency Global Barrier and Interrupt– Latency of round trip 1.3 µs

Control Network

BlueGene/L Interconnect Networks


The Torus Network

• 3 dimensional: 64 x 32 x 32– Each compute node is connected to its six neighbors: x+, x-, y+, y-, z+, z-– Compute card is 1x2x1– Node card is 4x4x2– 16 compute cards in 4x2x2 arrangement– Midplane is 8x8x8– 16 node cards in 2x2x4 arrangement

• Communication path– Each uni-directional link is 1.4Gb/s, or 175MB/s.– Each node can send and receive at 1.05GB/s.– Supports cut-through routing, along with both deterministic and adaptive

routing.– Variable-sized packets of 32,64,96…256 bytes– Guarantees reliable delivery


Complete BlueGene/L System at LLNL

BG/Lcomputenodes65,536

BG/LI/O nodes

1,024

Fed

erat

ed G

igab

it E

ther

net S

witc

h2,

048

port

s

Front-end nodes

Service node

WAN

visualization

archive

CWFS

8

8

Control network

8

512

128

64

48

1024


System Software Overview

• Operating system - Linux

• Compilers - IBM XL C, C++, Fortran95

• Communication - MPI, TCP/IP

• Parallel File System - GPFS, NFS support

• System Management - extensions to CSM

• Job scheduling - based on LoadLeveler

• Math libraries - ESSL


BG/L Software Hierarchical Organization

• Compute nodes dedicated to running user application, and almost nothing else - simple compute node kernel (CNK)

• I/O nodes run Linux and provide a more complete range of OS services – files, sockets, process launch, signaling, debugging, and termination

• Service node performs system management services (e.g., heart beating, monitoring errors) - transparent to application software


BG/L System Software

• Simplicity

– Space-sharing

– Single-threaded

– No demand paging

• Familiarity

– MPI (MPICH2)

– IBM XL Compilers for PowerPC


Operating Systems

• Front-end nodes are commodity systems running Linux

• I/O nodes run a customized Linux kernel

• Compute nodes use an extremely lightweight custom kernel

• Service node is a single multiprocessor machine running a custom OS


Compute Node Kernel (CNK)

• Single user, dual-threaded

• Flat address space, no paging

• Physical resources are memory-mapped

• Provides standard POSIX functionality (mostly)

• Two execution modes:

– Virtual node mode

– Coprocessor mode


Service Node OS

• Core Management and Control System (CMCS)

• BG/L’s “global” operating system.

• MMCS - Midplane Monitoring and Control System

• CIOMAN - Control and I/O Manager

• DB2 relational database


Running a User Job

• Compiled, and submitted from front-end node.

• External scheduler

• Service node sets up partition, and transfers user’s code to compute nodes.

• All file I/O is done using standard Unix calls (via the I/O nodes).

• Post-facto debugging done on front-end nodes.


Performance Issues

• User code is easily ported to BG/L.

• However, MPI implementation requires effort & skill

– Torus topology instead of crossbar

– Special hardware, such as collective network.


BG/L MPI Software Architecture

GI = Global InterruptCIO = Control and I/O ProtocolCH3 = Primary device distributed with

MPICH2 communicationMPD = Multipurpose Daemon


MPI_Bcast


MPI_Alltoall


References

• IBM Journal of Research and Development, Vol. 49, No. 2-3.– http://www.research.ibm.com/journal/rd49-23.html

» “Overview of the Blue Gene/L system architecture”» “Packaging the Blue Gene/L supercomputer”» “Blue Gene/L compute chip: Memory and Ethernet subsystems”» “Blue Gene/L torus interconnection network”» “Blue Gene/L programming and operating environment”» “Design and implementation of message-passing services for the

Blue Gene/L supercomputer”


References (cont.)

• BG/L homepage @ LLNL: <http://www.llnl.gov/ASC/platforms/bluegenel/>

• BlueGene homepage @ IBM: <http://www.research.ibm.com/bluegene/>


The End

Documents

Architecture of Parallel Computers CSC / ECE 506 BlueGene ...Architecture of Parallel Computers CSC / ECE 506 BlueGene Architecture Lecture 24 7/31/2006 Dr Steve Hunter. ... • sPPM