Westin Harbour Castle, August 24, 2000 The NUMAchine Multiprocessor ICPP 2000

Westin Harbour Castle, August 24, 2000

The NUMAchine Multiprocessor

ICPP 2000

2

Universityof

Toronto

Outline

ArchitectureSystem OverviewKey FeaturesFast ring routing

Hardware Cache Coherence

Memory Model: Sequential Consistency

Simulation StudiesRing performanceNetwork Cache performanceCoherence overhead

Prototype Performance

Hardware Status

Conclusion

Presentation Overview

3

Universityof

Toronto

Arch:Sys

Hierarchical ring network, based on clusters ( NUMAchine’s ‘Stations’) which are themselves bus-based SMPs

System Architecture

4

Universityof

Toronto

Arch:Features

Hierachical ringsAllow for very fast and simple routingProvide good support for broadcast and multicast

Hardware Cache CoherenceHierarchical, directory-based, CC-NUMA systemWriteback/Invalidate protocol, designed to use the broadcast/ordering

properties of rings

Sequentially Consistent Memory ModelThe most intuitive model for programmer’s trained on uniprocessors

Simple, low cost, but with good flexibility, scalability and performance

NUMAchine’s Key Features

5

Universityof

Toronto

Arch:Fmask

Fast ring routing is achieved by the use of Filtermasks (I.e. simple bit-masks) to store cache-line location information (imprecision reduces directory storage requirements)

These Filtermasks are used directly by the routing hardware in the ring interfaces

Fast Ring Routing: Filtermasks

6

Universityof

Toronto

CC

Hierarchical, directory-based, writeback/invalidate

Directory entries are stored in both the per-station memory (‘home’ location), and cached in the network interfaces (hence the name, Network Cache)

The Network Cache stores both the remotely cached directory information, as well as the cache lines themselves, and allows the network interface to perform coherence operations locally (on-Station), avoiding remote accesses to the home directory

Filtermasks indicate which Stations (I.e. clusters) may potentially have a copy of a cache line (with the fuzziness due to the imprecise nature of the filter masks)

Processor Masks are used only within a Station, to indicates which particular caches may contain a copy (with the fuzziness here due to Shared lines that may have been silently ejected)

Hardware Cache Coherence

7

Universityof

Toronto

SC

The most intuitive model for the normally trained programmer: increases the usability of the system

Easily supported by NUMAchine’s ring network: the only change necessary is to force invalidates to pass through a global ‘sequencing point’ on the ring, increasing the average invalidation latency by 2 ring hops (40 ns with our default 50 MHz rings)

Memory Model: Sequential Consistency

8

Universityof

Toronto

SS:RP1

Use the SPLASH-2 benchmarks suite, and a cycle-accurate hardware simulator with full modeling of the coherence protocol

Applications with high communication-to-computation ratios (e.g. FFT, Radix) show high utilizations, particularly in the Central Ring (indicating that a faster Central Ring would help)

Simulation Studies: Ring Performance 1

9

Universityof

Toronto

SS:RP2

Maximum and average ring interface queue depths indicate the network congestion, which correlates to bursty traffic

Large differences between the maximum and average values indicates large variability in burst size

Simulation Studies: Ring Performance 2

10

Universityof

Toronto

SS:NC

Graphs show a measure of the Network Cache’s effect by looking at the hit rate (I.e. reduction in remote data and coherence traffic)

By categorizing the hits by the coherence directory state, we also see where the benefits come from: caching shared data, or reducing invalidations and coherence traffic

Simulation Studies: Network Cache

11

Universityof

Toronto

SS:CO

Measure the overhead due to cache coherence, by allowing all writes to succeed immediately without checking cache-line state, and comparing against runs with the full cache coherence protocol in place (both using infinite-capacity Network Caches to avoid measurement noise due to capacity effects)

Results indicate that in many cases it is basic data locality and/or poor parallelizability that are impeding performance, not cache coherence

Simulation Studies: Coherence Overhead

12

Universityof

Toronto

PP

Prototype Performance Speedups from the hardware prototype, compared against

estimates from the simulator

13

Universityof

Toronto

Status

Fully operational running the custom Tornado OS, with a 32-processor system shown below

Hardware Prototype Status

14

Universityof

Toronto

Fin

4- and 8-way SMPs are fast becoming commodity items

The NUMAchine project has shown that a simple, cost-effective, CC-NUMA multiprocessor can be built using these SMP building blocks and a simple ring network, and still achieve good performance and scalability

In the medium-scale range (a few tens to hundreds of processors), rings are a good choice for a multiprocessor interconnect

We have demonstrated an efficient hardware cache coherence scheme, which is designed to make use of the natural ordering and broadcast capabilities of rings

NUMAchine’s architecture efficiently supports a sequentially consistent memory model, which we feel is essential for increasing the ease of use and programmability of multiprocessors

Conclusion

15

Universityof

Toronto

Ack

Operating Systems

Prof. Michael Stumm

Orran Krieger (IBM)

Ben Gamsa

Jonathon Appavoo

Robert Ho

Compilers

Prof. Tarek Abdelrahman

Prof. Naraig Manjikian (Queens)

Applications

Prof. Ken Sevcik

Acknowledgments: The NUMAchine Team

Hardware

Prof. Zvonko Vranesic

Prof. Stephen Brown

Robin Grindley (SOMA Networks)

Alex Grbic

Prof. Zeljko Zilic (McGill)

Steve Caranci (Altera)

Derek DeVries (OANDA)

Guy Lemieux

Kelvin Loveless (GNNettest)

Prof. Sinisa Srbljic (Zagreb)

Paul McHardy

Mitch Gusat (IBM)

Documents

Westin Harbour Castle, August 24, 2000 The NUMAchine Multiprocessor ICPP 2000