Upload
dinhkhanh
View
219
Download
0
Embed Size (px)
Citation preview
Papers
“Overview of theBlueGene/L Supercomputer”
© 2002 IBM and Lawrence Livermore National Laboratory
“Unlocking the Performance of the BlueGene/L Supercomputer”
© 2004 IBM and Lawrence Livermore National Laboratory
Topics
• History• Philosophy• System Overview• Architecture• Networks• Comparison• OS• Limitations• Conclusion
Context
• Earth Simulator (March 11, 2002)Ø35.86 teraflopsØFastest computer before Blue Gene/L
• Blue Gene/L (September 29, 2004)Ø IBM and Lawrence Livermore National LaboratoryØFirst in line of computers that would eventually pass the
petaflop markØ135.5 teraflops (and they weren’t even finished yet)
Purpose of Blue Gene/L
• Primarily perform simulations in the area of life sciences
• Protein folding• Will likely be used in the search for cures to
diseases:ØAlzheimer’sØCystic FibrosisØMad Cow Disease
Philosophy
• Obvious goal:ØCreate a supercomputer that runs as fast as
possible
• Typical approach:ØTake a bunch of really fast nodesØGroup them togetherØGive them all a lot of computation
responsibility
Philosophy
• Limitations of typical approach:ØThe large, fast SMP’s consume increasingly
large amounts of electricityØAddition of more processors delivered
additional processing power at a decreasing rate
Philosophy
• The Blue Gene/L Approach:ØCompletely differentØUse a “very large” number of nodes§ 65,536 to be exact
ØEach node has a modest clock rate§ About 700 MHz§ Low power consumption
ØNodes are given very specific task
Philosophy
• Other Design FeaturesØ IBM PowerPC embedded CMOS processorsØEmbedded DRAMØSystem-on-chip techniquesØDual-processor design (more on that below)
• Dual ProcessorØCompute Node§ Handles computation
Ø I/O Node§ Handles communication
Philosophy
• Why dual-processor?ØThe I/O nodes would provide the physical
interface to the file system and various other processes that would be burdensome for the compute nodesØAllow the compute node software to be kept
simpleØIn keeping with the philosophy…
Peak Performance
• The Blue Gene team estimates that the BG/L’s peak performance will be about 360 teraflops
• Applications that do not take advantage of second processor should expect peak performance of 180 teraflops
System Overview
• Each nodeØSingle Application Specific Integrated Circuit (ASIC)Ø2 GB local memory
• 2 nodes / compute card• 16 compute cards /node board• 16 node boards / midplane• 2 midplanes / 1024-node rack• 64 racks
System Overview
• “Link” ASICØBetween the midplanesØServes two purposes§ 1) Re-drives (and therefore strengthens) the signal
between midplanes§ 2) Allows the signals to be redirected between
different ports
Architecture
• Each node has 2 PowerPC 440 processors, 700MHz
• 2 Different execution modes in which the processors interactØCommunication mode (default)§ One processor-> Communicating§ One processor-> General Processing
ØVirtual Mode
Virtual Mode
• Processors act independently• Each processor gets half of memory
and a separate MPI taskØTasks share use of network and memoryØSpecial region of shared non-caches shared
memory allows communication within the same node
Architecture
• The BG/L has a Double Floating Point Unit (DFPU)ØBuilt by merging 2 FPU’s (Primary and Secondary)ØSecondary has its own set of instructions to support
complex arithmetic
• Code Generation for DFPU done by TOBEYØTOBEY recognizes complex computations and uses
SIMD-like extensions of BG/L to efficiently implement computations
5 Networks
• A 3D torus network • Global tree network• Global barrier and interrupt network• Gigabit Ethernet to Joint Test Access Group
(JTAG) network for machine control• A second Gigabit Ethernet network for
connection to other systems, such as hosts and file systems
Torus Network
• Does the BG/L’s general computing• Connects each node by making each node
have 6 adjacent neighbors• Bandwidth for these links Ø2 bits/cycle orØ175 MB/s @ 700 MHz
• Each message is broken into packetsØRange: 32 bytes - 256 bytesØ32-byte increments
Tree Network• Used for collective communication patterns
that often occur such as broadcasting or reduction
• A network that combines 2 or more star networks togetherØStar network: Network where all of the
workstation nodes are linked to one central nodeØBandwidth of 350 MB/s
BG/L vs. Earth Simulator
• 65,536 nodesØTwo processorsØ2 GB memory
• 5 Networks
135.5 TeraFLOPS
• 640 nodesØ8 vector processorsØ16 GB of memory
• SX-6 architecture
35.86 TeraFLOPS
OS• BG/L uses Linux for its front-end nodes• Its compute nodes don’t use Linux, but have a
kernel that is inspired by it• Because BG/L is based on Linux, testing was done
on Linux clustersØBGLism: Parallel application created to simulate BG/L
• Most supercomputers are moving towards Linux (not Win-doze!!)Ø CheaperØ LibrariesØ Familiarity
Limitations
• Not a general purpose machine• Designed to solve grid-based problems that
involve nodes communicating with nearest neighbor
• Most problems BG/L will solve are found in high-energy physics, molecular dynamics and astrophysics
Conclusion
• BG/L implements a new philosophy for supercomputers
• It uses low speed processors that each handle a relatively low work load
• The architecture of Blue Gene/L makes it the fastest supercomputer in the world