18
© 2006 Regents University of California. All Rights Reserved RAMP Blue: A Message Passing Multi-Processor System on the BEE2 Andrew Schultz and Alex Krasnov P.I. John Wawrzynek

© 2006 Regents University of California. All Rights Reserved RAMP Blue: A Message Passing Multi-Processor System on the BEE2 Andrew Schultz and Alex Krasnov

  • View
    216

  • Download
    0

Embed Size (px)

Citation preview

Page 1: © 2006 Regents University of California. All Rights Reserved RAMP Blue: A Message Passing Multi-Processor System on the BEE2 Andrew Schultz and Alex Krasnov

© 2006 Regents University of California. All Rights Reserved

RAMP Blue: A Message Passing Multi-Processor System on the BEE2

Andrew Schultz and Alex Krasnov

P.I. John Wawrzynek

Page 2: © 2006 Regents University of California. All Rights Reserved RAMP Blue: A Message Passing Multi-Processor System on the BEE2 Andrew Schultz and Alex Krasnov

RAMP Blue 2 © 2006 Regents University of California. All Rights Reserved

Introduction• RAMP Blue is an initial design driver for the RAMP project with the goal

of building an extremely large (~1000 node) multiprocessor system on a cluster of BEE2 modules using soft-core processors

– Goal of RAMP Blue is to experiment and learn lessons on building large scale computer architectures on FPGA platforms for emulation, not performance

• Current system has 256 cores on 8-module cluster running full parallel benchmarks, easy upgrade to 512 cores on 16-modules once boards are available

Page 3: © 2006 Regents University of California. All Rights Reserved RAMP Blue: A Message Passing Multi-Processor System on the BEE2 Andrew Schultz and Alex Krasnov

RAMP Blue 3 © 2006 Regents University of California. All Rights Reserved

RDF: RAMP Design Framework• All designs to be implemented within RAMP are known as the target

system and must follow some restrictions defined by the RDF– All designs are composed of units with well defined ports that communicate

with each other over uni-directional, point-to-point channels

– Units can be as simple as a single logical gate, or more often a larger unit such as a CPU core or cache

– Timing between units is completely decoupled by the channel

Page 4: © 2006 Regents University of California. All Rights Reserved RAMP Blue: A Message Passing Multi-Processor System on the BEE2 Andrew Schultz and Alex Krasnov

RAMP Blue 4 © 2006 Regents University of California. All Rights Reserved

RAMP Blue Goals

• RAMP Blue is sibling to other RAMP design driver projects:1) RAMP Red: Port of existing transactional cache system to FPGA PowerPC

cores2) RAMP Blue: Message passing multiprocessor system with existing, FPGA

optimized soft core (MicroBlaze)3) RAMP White: Cache coherent multiprocessor system with full featured soft-

core• Blue also to run off-the-shelf, message passing, scientific codes and

benchmarks (provide existing tests and basis for comparison)• Main goal is to fit as many cores as possible in the system and have the

ability to reliably run code and change system parameters

Page 5: © 2006 Regents University of California. All Rights Reserved RAMP Blue: A Message Passing Multi-Processor System on the BEE2 Andrew Schultz and Alex Krasnov

RAMP Blue 5 © 2006 Regents University of California. All Rights Reserved

RAMP Blue Requirements

• Built from existing tools (RDL not available at time) but fit RDF guidelines for future integration

• Require design and implementation of gateware and software for implementing MicroBlaze with uClinux on BEE2 modules

– Sharing of DDR2 memory system– Communication and bootstrapping from user FPGAs– Debugging and control from control FPGA

• New on-chip network for MicroBlaze to MicroBlaze communication– Communication on-chip, FPGA to FPGA on board, and board to board

• Completely new double precision floating point unit for scientific codes

Page 6: © 2006 Regents University of California. All Rights Reserved RAMP Blue: A Message Passing Multi-Processor System on the BEE2 Andrew Schultz and Alex Krasnov

RAMP Blue 6 © 2006 Regents University of California. All Rights Reserved

MicroBlaze Characteristics

• 3-stage, RISC like architecture designed for implementation on FPGAs– Takes advantage of FPGA unique features (e.g. fast carry chains) and

addresses FPGA shortcomings (e.g. lack of CAMs in cache)• Maximum clock rate of 100 MHz (~0.5 MIPS/MHz) on Virtex-II Pro

FPGAs• Split I and D cache with configurable size, direct mapped• Fast hardware multiplier/divider, optional hardware barrel shifter• Configurable hardware debugging support (watch/breakpoints)• Several peripheral interface bus options• GCC tool chain support and ability to run uClinux

Page 7: © 2006 Regents University of California. All Rights Reserved RAMP Blue: A Message Passing Multi-Processor System on the BEE2 Andrew Schultz and Alex Krasnov

RAMP Blue 7 © 2006 Regents University of California. All Rights Reserved

Node Architecture

Page 8: © 2006 Regents University of California. All Rights Reserved RAMP Blue: A Message Passing Multi-Processor System on the BEE2 Andrew Schultz and Alex Krasnov

RAMP Blue 8 © 2006 Regents University of California. All Rights Reserved

Memory System• Requires sharing memory channel with a configurable number of

MicroBlaze cores– No coherence, each DIMM is partitioned, but bank management keeps

cores from fighting with each other

Page 9: © 2006 Regents University of California. All Rights Reserved RAMP Blue: A Message Passing Multi-Processor System on the BEE2 Andrew Schultz and Alex Krasnov

RAMP Blue 9 © 2006 Regents University of California. All Rights Reserved

Control Communication• Communication channel from

control PowerPC to individual MicroBlaze required for bootstrapping and debugging

– Gateware provides general purpose, low-speed network

– Software provides character and Ethernet abstraction on channel

– Kernel is sent over channel and file systems can be mounted

– Console channel allows debugging messages and control

Page 10: © 2006 Regents University of California. All Rights Reserved RAMP Blue: A Message Passing Multi-Processor System on the BEE2 Andrew Schultz and Alex Krasnov

RAMP Blue 10 © 2006 Regents University of California. All Rights Reserved

Double Precision FPU• Due to size of FPU, sharing is crucial to meeting resource budget• Shared FPU much like reservation stations in microarchitecture with

MicroBlaze issuing instructions

Page 11: © 2006 Regents University of California. All Rights Reserved RAMP Blue: A Message Passing Multi-Processor System on the BEE2 Andrew Schultz and Alex Krasnov

RAMP Blue 11 © 2006 Regents University of California. All Rights Reserved

Network Characteristics• Interconnect must fit within the RDF model

– Network interface uses simple FSL channels, currently PIO but could be DMA

• Source routing between nodes (non-adaptive, link failure intolerant)– Only links to physically fail would be board-to-board XAUI links

• Topology of interconnect is full cross-bar on chip with all-to-all connection of board-to-board links

– Longest path between nodes is four on-board links and one off-board link• Encapsulated Ethernet packets with source routing information

prepended• Virtual cut through flow control with virtual channels for deadlock

avoidance

Page 12: © 2006 Regents University of California. All Rights Reserved RAMP Blue: A Message Passing Multi-Processor System on the BEE2 Andrew Schultz and Alex Krasnov

RAMP Blue 12 © 2006 Regents University of California. All Rights Reserved

Network Implementation

Page 13: © 2006 Regents University of California. All Rights Reserved RAMP Blue: A Message Passing Multi-Processor System on the BEE2 Andrew Schultz and Alex Krasnov

RAMP Blue 13 © 2006 Regents University of California. All Rights Reserved

Multiple Cores• Scaling up to mulitple cores per

FPGA is primarily constrained by resources

– The current evaulation cluster implements 8 cores/FPGA using roughly 85% of the slices (but only slightly more than half of the LUTs/FFs)

• Sixteen cores fit on each FPGA without infrastructure (switch, memory, etc), 10-12 are maximum depending on options

– Options include hardware accelerators, cache size, FPU timing, etc.

Page 14: © 2006 Regents University of California. All Rights Reserved RAMP Blue: A Message Passing Multi-Processor System on the BEE2 Andrew Schultz and Alex Krasnov

RAMP Blue 14 © 2006 Regents University of California. All Rights Reserved

Test Cluster• Sixteen BEE2 modules with 8 cores

per user FPGA and two 1 GB DDR2 DIMMs per user FPGA

– Overall cluster has 512 cores, scaling up to 12 cores per FPGA and utilizing the control FPGA it is realistic to achieve 960 cores in the cluster

Page 15: © 2006 Regents University of California. All Rights Reserved RAMP Blue: A Message Passing Multi-Processor System on the BEE2 Andrew Schultz and Alex Krasnov

RAMP Blue 15 © 2006 Regents University of California. All Rights Reserved

Software

• Each node in the cluster boots its own copy of uClinux and mounts a file system from an external NFS file system

• The Unified Parallel C (UPC) message passing framework was ported to uClinux

– The main porting effort with UPC is adapting to its transport layer, GASnet, but this was circumvented by using the GASnet UDP transport layer

• Floating point integration is achieved via modification to the GCC SoftFPU backend to emit code to interact with FPU

• Within UPC the NAS Parallel Benchmark runs on the cluster– Only class “S” benchmarks can be run due to the limited memory

(256MB/node)

Page 16: © 2006 Regents University of California. All Rights Reserved RAMP Blue: A Message Passing Multi-Processor System on the BEE2 Andrew Schultz and Alex Krasnov

RAMP Blue 16 © 2006 Regents University of California. All Rights Reserved

Performance• Performance is not the key metric for success with RAMP Blue

– While improving performance is secondary goal, ability to reliably implement a system with a wide range of parameters and meet timing closure in with desired resources is primary goal

• Analysis of performance points out bottlenecks for incremental improvement in future RAMP infrastructure designs

– Analysis of node to node network shows software (i.e. network interface) is primary bottleneck, finer grained analysis forthcoming with RDL port

• Just for the heck of it, the NAS performance numbers:

Page 17: © 2006 Regents University of California. All Rights Reserved RAMP Blue: A Message Passing Multi-Processor System on the BEE2 Andrew Schultz and Alex Krasnov

RAMP Blue 17 © 2006 Regents University of California. All Rights Reserved

Implementation Issues

• Building such a large design exposed several insidious bugs in both hardware and gateware

– MicroBlaze bugs in both gateware and GCC toolchain required a good deal of time to track down (race conditions, OS bugs, GCC backend bugs)

– Memory errors with large designs on BEE2, still not completely understood, probably has to do with noise on power plane increasing clock jitter

• Lack of debugging insight and long recompile time greatly hindered progress

• Building large cluster exposed bugs caused by variation in BEE2 boards• Multiple layers of user control (FGPA, processor, I/O, software) all

contribute to uncertainty in operation

Page 18: © 2006 Regents University of California. All Rights Reserved RAMP Blue: A Message Passing Multi-Processor System on the BEE2 Andrew Schultz and Alex Krasnov

RAMP Blue 18 © 2006 Regents University of California. All Rights Reserved

Conclusion• RAMP Blue represents the first steps to developing a robust library of

RAMP infrastructure for building more complicated parallel systems– Much of the RAMP Blue gateware is directly applicable to future systems– Many important lessons were learned about required debugging/insight

capabilities– New bugs and reliability issues were exposed in BEE2 platform and

gateware to help influence future RAMP hardware platforms and characteristics for robust software/gateware infrastructure

• RAMP Blue also represents the largest soft-core, FPGA based computing system ever built and demonstrates the incredible research flexibility such systems allow

– Ability to literally tweak the hardware interfaces and parameters provide a “research sandbox” for exciting new possibilities

– E.g. add DMA and RDMA to networking, arbitrarily tweak network topology and experiment with system level paging and coherence, etc.