June 30th, 2006 ICS’06 -- Håkan Zeffer: [email protected] Håkan Zeffer Zoran Radovic Martin Karlsson Erik Hagersten Uppsala University Sweden TMA A Trap-Based

ICS’06 -- Håkan Zeffer: [email protected] June 30th, 2006

Håkan Zeffer

Zoran Radovic

Martin Karlsson

Erik Hagersten

Uppsala University

Sweden

TMA

A Trap-Based Memory Architecture

TMA


Simultaneous Multithreading (SMT)

Diminishing performance from ILP Increased chip parallelism from hardware threading (TLP)

IBM Power5, Intel Pentium4, Sun T1 (Niagara) “No processor should come without multiple threads” [Dr. Tremblay]

fetch unitdecode, rename etc.

integer pipe

floating-point pipe

memory pipe

branch pipe

L1I L1D


Chip Multiprocessors (CMPs)

interconnect

I D I D I D I D

P P P P

L2 L2 L2 L2

Chip Multiprocessors (CMPs)

Piranha, IBM Power4, IBM Power5, Sun UltraSPARC IV+, Sun T1, Intel Duo, AMD Dual-Core Opteron


Multi-CMP Systems

CMP 3 CMP 4

CMP 2CMP 1

interconnect

I D I D I D I D

P P P P

L2 L2 L2 L2

Larger systems sometimes built from multiple CMPs Piranha, IBM Power4 and IBM Power5

interconnect


Multi-CMP Coherence

Inter-CMP Coherence

Intra-CMP Coherence

Intra-CMP protocol for coherence within CMP Inter-CMP protocol for coherence between CMPs Interactions between protocols increase complexity

CMP 3 CMP 4

CMP 2CMP 1

interconnect


Shared-Memory Trends

Today’s chips = yesterday’s mid-range servers Sun T1 has 32 hardware threads on a single die

Is it worth to implement multi-CMP systems? Increased development cost Increased verification cost How big is the market?


Trap-Based Memory Architectures

TMA: Trap-based Memory Architecture

Basic idea Optimize for commercial singe-chip performance Let simple HW and SW support enable scalability

Coherence violation detection in hardware Trap on inter-chip coherence violations Solve inter-chip coherence misses in software


Outline

Introduction TMA and TMA Lite Evaluation methodology Results Related work Future work Conclusions


TMA Lite

TMA Lite is a “minimal” TMA implementation

Runtime system• Deadlock avoidance• Coherence protocol

Per application “scalability” Binary transparency No memory system modifications Simple processor core modifications

An inter-node load coherence check An inter-node store coherence check


A TMA Lite System

TMA Lite nodes Single-chip system

• Load and store coherence check support

HW maintains intra-chip coherence

TMA Lite cluster network “InfiniBand like” High-bandwidth Low-latency Remote memory access (put, get and atomic)

TMA Lite software Coherence and consistency between nodes


The Load Check

Magic value convention Each cache line in state invalid contains a predefined value

Hardware Comparator at the load path detects this value Trap generated when the value is found

magic value register

=?

data

&

load check enabled?

load trap?

Controlled by system software

False misses When the magic value is used within an application Easy to detect and solve within the coherence protocol Rare


The Store Check

Write permission cache (WPC) Can be seen as a very small cache Operates on virtual addresses Accessed in parallel with the data TLB Write permission for lines in the WPC guaranteed by protocol

trap?

trap?dataTLB

WPC

Data L1

Addressgeneration

TLB accessWPC accessStart L1 access

Tagcompare

TLB trap?WPC trap?End L1 access

...

hit?

data

The write permission cache has to be filled A fill occurs at all WPC misses Even if the node already has write permission Overhead often severe


Simulator and Benchmarks

Simics: full-system simulator

Vasa: timing- and memory-model extension Cycle accurate Power5 like SMT processor model Latency and bandwidth of caches, memory and network

SPLASH-2 benchmarks


System Parameters

Scaled down Power5 chip 1 or 2 processor cores per chip 2 SMT threads per processor core Write through L1 Write back L2 and L3

• L2 on-die, L3 tags on-die

The HW distributed shared memory system Directory: fully mapped bit vector, dedicated SRAM Coherence protocol: HW, highly optimized, non-blocking

The TMA Lite system Directory: fully mapped bit vector, in ordinary DRAM memory Coherence protocol

• SW• Binary patch to Solaris modifies the trap vector• Coherence protocol run on the hardware thread that caused the miss


Execution Time Breakdown

Execution time is normalized to the HW DSM.4 nodes, load comparator + 16 entry WPC.


Coherence Protocol Breakdown


SW Flexibility: Coherence Unit Size

Execution time is normalized to the HW DSM.4 nodes, load comparator + 16 entry WPC.


Related Work

SW only Page-based systems

• IVY, Munin, Cashmere, GeNIMA, Treadmarks + many more• Virtual memory used for coherence detection

Fine-grained systems• Shasta, Blizzard, Sirocco, DSZOOM• Coherence checks instrumented into applications

HW support + software protocol FLASH, Typhoon, S3.mp

• Coherence processor executes the coherence protocol

SMTp• SMT thread executes the coherence protocol


Future Work

More mature TMA implementations Coherence detection on physical addresses System (instead of application) scalability

(Proceedings figure text error: Internet pdf is OK!) One proposal is already available as a tech. report

Available at: http://www.it.uu.se/research/publications/reports/2006-031 New coherence detection scheme

• No “false” load or store coherence misses A new way to decouple inter- and intra-chip coherence In DRAM memory remote access caching Commercial applications Much more experiments Very promising results


Conclusions

Shared memory trends SMT and CMP Mid-range servers on a single chip

Trap-based Memory Architecture Design for commercial single chip performance Simple and small HW structures for scalable shared memory

TMA Lite “Minimal” TMA implementation Competitive to HW DSM when flexibility is used Promising for HPC when runtime system is under control

Given the right HW/SW tradeoff simple and efficient scalable shared memory is possible

More mature TMA arch. in next paper (the tech. report)


Questions?


The Coherence Protocol

Documents

June 30th, 2006 ICS’06 -- Håkan Zeffer: [email protected] Håkan Zeffer Zoran Radovic Martin Karlsson Erik Hagersten Uppsala University Sweden TMA A Trap-Based