26
The Blue Gene/L Supercomputer: Architecture and Implementation David Gregg Jake Johnson

The Blue Gene/L Supercomputer: Architecture and …cs425/presentations/gregg-johnson... · • Earth Simulator (March 11, 2002) Ø35.86 teraflops ØFastest computer before Blue Gene/L

Embed Size (px)

Citation preview

The Blue Gene/L Supercomputer:Architecture and Implementation

David GreggJake Johnson

Papers

“Overview of theBlueGene/L Supercomputer”

© 2002 IBM and Lawrence Livermore National Laboratory

“Unlocking the Performance of the BlueGene/L Supercomputer”

© 2004 IBM and Lawrence Livermore National Laboratory

Topics

• History• Philosophy• System Overview• Architecture• Networks• Comparison• OS• Limitations• Conclusion

Context

• Earth Simulator (March 11, 2002)Ø35.86 teraflopsØFastest computer before Blue Gene/L

• Blue Gene/L (September 29, 2004)Ø IBM and Lawrence Livermore National LaboratoryØFirst in line of computers that would eventually pass the

petaflop markØ135.5 teraflops (and they weren’t even finished yet)

Purpose of Blue Gene/L

• Primarily perform simulations in the area of life sciences

• Protein folding• Will likely be used in the search for cures to

diseases:ØAlzheimer’sØCystic FibrosisØMad Cow Disease

Philosophy

• Obvious goal:ØCreate a supercomputer that runs as fast as

possible

• Typical approach:ØTake a bunch of really fast nodesØGroup them togetherØGive them all a lot of computation

responsibility

Philosophy

• Limitations of typical approach:ØThe large, fast SMP’s consume increasingly

large amounts of electricityØAddition of more processors delivered

additional processing power at a decreasing rate

Philosophy

• The Blue Gene/L Approach:ØCompletely differentØUse a “very large” number of nodes§ 65,536 to be exact

ØEach node has a modest clock rate§ About 700 MHz§ Low power consumption

ØNodes are given very specific task

Philosophy

• Other Design FeaturesØ IBM PowerPC embedded CMOS processorsØEmbedded DRAMØSystem-on-chip techniquesØDual-processor design (more on that below)

• Dual ProcessorØCompute Node§ Handles computation

Ø I/O Node§ Handles communication

Philosophy

• Why dual-processor?ØThe I/O nodes would provide the physical

interface to the file system and various other processes that would be burdensome for the compute nodesØAllow the compute node software to be kept

simpleØIn keeping with the philosophy…

Peak Performance

• The Blue Gene team estimates that the BG/L’s peak performance will be about 360 teraflops

• Applications that do not take advantage of second processor should expect peak performance of 180 teraflops

System Overview

• Each nodeØSingle Application Specific Integrated Circuit (ASIC)Ø2 GB local memory

• 2 nodes / compute card• 16 compute cards /node board• 16 node boards / midplane• 2 midplanes / 1024-node rack• 64 racks

System Overview

System Overview

• “Link” ASICØBetween the midplanesØServes two purposes§ 1) Re-drives (and therefore strengthens) the signal

between midplanes§ 2) Allows the signals to be redirected between

different ports

Architecture

• Each node has 2 PowerPC 440 processors, 700MHz

• 2 Different execution modes in which the processors interactØCommunication mode (default)§ One processor-> Communicating§ One processor-> General Processing

ØVirtual Mode

Virtual Mode

• Processors act independently• Each processor gets half of memory

and a separate MPI taskØTasks share use of network and memoryØSpecial region of shared non-caches shared

memory allows communication within the same node

Architecture

• The BG/L has a Double Floating Point Unit (DFPU)ØBuilt by merging 2 FPU’s (Primary and Secondary)ØSecondary has its own set of instructions to support

complex arithmetic

• Code Generation for DFPU done by TOBEYØTOBEY recognizes complex computations and uses

SIMD-like extensions of BG/L to efficiently implement computations

5 Networks

• A 3D torus network • Global tree network• Global barrier and interrupt network• Gigabit Ethernet to Joint Test Access Group

(JTAG) network for machine control• A second Gigabit Ethernet network for

connection to other systems, such as hosts and file systems

Torus Network

• Does the BG/L’s general computing• Connects each node by making each node

have 6 adjacent neighbors• Bandwidth for these links Ø2 bits/cycle orØ175 MB/s @ 700 MHz

• Each message is broken into packetsØRange: 32 bytes - 256 bytesØ32-byte increments

Tree Network• Used for collective communication patterns

that often occur such as broadcasting or reduction

• A network that combines 2 or more star networks togetherØStar network: Network where all of the

workstation nodes are linked to one central nodeØBandwidth of 350 MB/s

BG/L vs. Earth Simulator

BG/L vs. Earth Simulator

• 65,536 nodesØTwo processorsØ2 GB memory

• 5 Networks

135.5 TeraFLOPS

• 640 nodesØ8 vector processorsØ16 GB of memory

• SX-6 architecture

35.86 TeraFLOPS

OS• BG/L uses Linux for its front-end nodes• Its compute nodes don’t use Linux, but have a

kernel that is inspired by it• Because BG/L is based on Linux, testing was done

on Linux clustersØBGLism: Parallel application created to simulate BG/L

• Most supercomputers are moving towards Linux (not Win-doze!!)Ø CheaperØ LibrariesØ Familiarity

Limitations

• Not a general purpose machine• Designed to solve grid-based problems that

involve nodes communicating with nearest neighbor

• Most problems BG/L will solve are found in high-energy physics, molecular dynamics and astrophysics

Conclusion

• BG/L implements a new philosophy for supercomputers

• It uses low speed processors that each handle a relatively low work load

• The architecture of Blue Gene/L makes it the fastest supercomputer in the world

Are there any questions??

64