22
ICS’06 -- Håkan Zeffer: [email protected] June 30th, 2006 Håkan Zeffer Zoran Radovic Martin Karlsson Erik Hagersten Uppsala University Sweden TMA A Trap-Based Memory Architecture TMA

June 30th, 2006 ICS’06 -- Håkan Zeffer: [email protected] Håkan Zeffer Zoran Radovic Martin Karlsson Erik Hagersten Uppsala University Sweden TMA A Trap-Based

Embed Size (px)

Citation preview

Page 1: June 30th, 2006 ICS’06 -- Håkan Zeffer: zeffer@it.uu.se Håkan Zeffer Zoran Radovic Martin Karlsson Erik Hagersten Uppsala University Sweden TMA A Trap-Based

ICS’06 -- Håkan Zeffer: [email protected] June 30th, 2006

Håkan Zeffer

Zoran Radovic

Martin Karlsson

Erik Hagersten

Uppsala University

Sweden

TMA

A Trap-Based Memory Architecture

TMA

Page 2: June 30th, 2006 ICS’06 -- Håkan Zeffer: zeffer@it.uu.se Håkan Zeffer Zoran Radovic Martin Karlsson Erik Hagersten Uppsala University Sweden TMA A Trap-Based

ICS’06 -- Håkan Zeffer: [email protected] June 30th, 2006

Simultaneous Multithreading (SMT)

Diminishing performance from ILP Increased chip parallelism from hardware threading (TLP)

IBM Power5, Intel Pentium4, Sun T1 (Niagara) “No processor should come without multiple threads” [Dr. Tremblay]

fetch unitdecode, rename etc.

integer pipe

floating-point pipe

memory pipe

branch pipe

L1I L1D

Page 3: June 30th, 2006 ICS’06 -- Håkan Zeffer: zeffer@it.uu.se Håkan Zeffer Zoran Radovic Martin Karlsson Erik Hagersten Uppsala University Sweden TMA A Trap-Based

ICS’06 -- Håkan Zeffer: [email protected] June 30th, 2006

Chip Multiprocessors (CMPs)

interconnect

I D I D I D I D

P P P P

L2 L2 L2 L2

Chip Multiprocessors (CMPs)

Piranha, IBM Power4, IBM Power5, Sun UltraSPARC IV+, Sun T1, Intel Duo, AMD Dual-Core Opteron

Page 4: June 30th, 2006 ICS’06 -- Håkan Zeffer: zeffer@it.uu.se Håkan Zeffer Zoran Radovic Martin Karlsson Erik Hagersten Uppsala University Sweden TMA A Trap-Based

ICS’06 -- Håkan Zeffer: [email protected] June 30th, 2006

Multi-CMP Systems

CMP 3 CMP 4

CMP 2CMP 1

interconnect

I D I D I D I D

P P P P

L2 L2 L2 L2

Larger systems sometimes built from multiple CMPs Piranha, IBM Power4 and IBM Power5

interconnect

Page 5: June 30th, 2006 ICS’06 -- Håkan Zeffer: zeffer@it.uu.se Håkan Zeffer Zoran Radovic Martin Karlsson Erik Hagersten Uppsala University Sweden TMA A Trap-Based

ICS’06 -- Håkan Zeffer: [email protected] June 30th, 2006

Multi-CMP Coherence

Inter-CMP Coherence

Intra-CMP Coherence

Intra-CMP protocol for coherence within CMP Inter-CMP protocol for coherence between CMPs Interactions between protocols increase complexity

CMP 3 CMP 4

CMP 2CMP 1

interconnect

Page 6: June 30th, 2006 ICS’06 -- Håkan Zeffer: zeffer@it.uu.se Håkan Zeffer Zoran Radovic Martin Karlsson Erik Hagersten Uppsala University Sweden TMA A Trap-Based

ICS’06 -- Håkan Zeffer: [email protected] June 30th, 2006

Shared-Memory Trends

Today’s chips = yesterday’s mid-range servers Sun T1 has 32 hardware threads on a single die

Is it worth to implement multi-CMP systems? Increased development cost Increased verification cost How big is the market?

Page 7: June 30th, 2006 ICS’06 -- Håkan Zeffer: zeffer@it.uu.se Håkan Zeffer Zoran Radovic Martin Karlsson Erik Hagersten Uppsala University Sweden TMA A Trap-Based

ICS’06 -- Håkan Zeffer: [email protected] June 30th, 2006

Trap-Based Memory Architectures

TMA: Trap-based Memory Architecture

Basic idea Optimize for commercial singe-chip performance Let simple HW and SW support enable scalability

Coherence violation detection in hardware Trap on inter-chip coherence violations Solve inter-chip coherence misses in software

Page 8: June 30th, 2006 ICS’06 -- Håkan Zeffer: zeffer@it.uu.se Håkan Zeffer Zoran Radovic Martin Karlsson Erik Hagersten Uppsala University Sweden TMA A Trap-Based

ICS’06 -- Håkan Zeffer: [email protected] June 30th, 2006

Outline

Introduction TMA and TMA Lite Evaluation methodology Results Related work Future work Conclusions

Page 9: June 30th, 2006 ICS’06 -- Håkan Zeffer: zeffer@it.uu.se Håkan Zeffer Zoran Radovic Martin Karlsson Erik Hagersten Uppsala University Sweden TMA A Trap-Based

ICS’06 -- Håkan Zeffer: [email protected] June 30th, 2006

TMA Lite

TMA Lite is a “minimal” TMA implementation

Runtime system• Deadlock avoidance• Coherence protocol

Per application “scalability” Binary transparency No memory system modifications Simple processor core modifications

An inter-node load coherence check An inter-node store coherence check

Page 10: June 30th, 2006 ICS’06 -- Håkan Zeffer: zeffer@it.uu.se Håkan Zeffer Zoran Radovic Martin Karlsson Erik Hagersten Uppsala University Sweden TMA A Trap-Based

ICS’06 -- Håkan Zeffer: [email protected] June 30th, 2006

A TMA Lite System

TMA Lite nodes Single-chip system

• Load and store coherence check support

HW maintains intra-chip coherence

TMA Lite cluster network “InfiniBand like” High-bandwidth Low-latency Remote memory access (put, get and atomic)

TMA Lite software Coherence and consistency between nodes

Page 11: June 30th, 2006 ICS’06 -- Håkan Zeffer: zeffer@it.uu.se Håkan Zeffer Zoran Radovic Martin Karlsson Erik Hagersten Uppsala University Sweden TMA A Trap-Based

ICS’06 -- Håkan Zeffer: [email protected] June 30th, 2006

The Load Check

Magic value convention Each cache line in state invalid contains a predefined value

Hardware Comparator at the load path detects this value Trap generated when the value is found

magic value register

=?

data

&

load check enabled?

load trap?

Controlled by system software

False misses When the magic value is used within an application Easy to detect and solve within the coherence protocol Rare

Page 12: June 30th, 2006 ICS’06 -- Håkan Zeffer: zeffer@it.uu.se Håkan Zeffer Zoran Radovic Martin Karlsson Erik Hagersten Uppsala University Sweden TMA A Trap-Based

ICS’06 -- Håkan Zeffer: [email protected] June 30th, 2006

The Store Check

Write permission cache (WPC) Can be seen as a very small cache Operates on virtual addresses Accessed in parallel with the data TLB Write permission for lines in the WPC guaranteed by protocol

trap?

trap?dataTLB

WPC

Data L1

Addressgeneration

TLB accessWPC accessStart L1 access

Tagcompare

TLB trap?WPC trap?End L1 access

...

hit?

data

The write permission cache has to be filled A fill occurs at all WPC misses Even if the node already has write permission Overhead often severe

Page 13: June 30th, 2006 ICS’06 -- Håkan Zeffer: zeffer@it.uu.se Håkan Zeffer Zoran Radovic Martin Karlsson Erik Hagersten Uppsala University Sweden TMA A Trap-Based

ICS’06 -- Håkan Zeffer: [email protected] June 30th, 2006

Simulator and Benchmarks

Simics: full-system simulator

Vasa: timing- and memory-model extension Cycle accurate Power5 like SMT processor model Latency and bandwidth of caches, memory and network

SPLASH-2 benchmarks

Page 14: June 30th, 2006 ICS’06 -- Håkan Zeffer: zeffer@it.uu.se Håkan Zeffer Zoran Radovic Martin Karlsson Erik Hagersten Uppsala University Sweden TMA A Trap-Based

ICS’06 -- Håkan Zeffer: [email protected] June 30th, 2006

System Parameters

Scaled down Power5 chip 1 or 2 processor cores per chip 2 SMT threads per processor core Write through L1 Write back L2 and L3

• L2 on-die, L3 tags on-die

The HW distributed shared memory system Directory: fully mapped bit vector, dedicated SRAM Coherence protocol: HW, highly optimized, non-blocking

The TMA Lite system Directory: fully mapped bit vector, in ordinary DRAM memory Coherence protocol

• SW• Binary patch to Solaris modifies the trap vector• Coherence protocol run on the hardware thread that caused the miss

Page 15: June 30th, 2006 ICS’06 -- Håkan Zeffer: zeffer@it.uu.se Håkan Zeffer Zoran Radovic Martin Karlsson Erik Hagersten Uppsala University Sweden TMA A Trap-Based

ICS’06 -- Håkan Zeffer: [email protected] June 30th, 2006

Execution Time Breakdown

Execution time is normalized to the HW DSM.4 nodes, load comparator + 16 entry WPC.

Page 16: June 30th, 2006 ICS’06 -- Håkan Zeffer: zeffer@it.uu.se Håkan Zeffer Zoran Radovic Martin Karlsson Erik Hagersten Uppsala University Sweden TMA A Trap-Based

ICS’06 -- Håkan Zeffer: [email protected] June 30th, 2006

Coherence Protocol Breakdown

Page 17: June 30th, 2006 ICS’06 -- Håkan Zeffer: zeffer@it.uu.se Håkan Zeffer Zoran Radovic Martin Karlsson Erik Hagersten Uppsala University Sweden TMA A Trap-Based

ICS’06 -- Håkan Zeffer: [email protected] June 30th, 2006

SW Flexibility: Coherence Unit Size

Execution time is normalized to the HW DSM.4 nodes, load comparator + 16 entry WPC.

Page 18: June 30th, 2006 ICS’06 -- Håkan Zeffer: zeffer@it.uu.se Håkan Zeffer Zoran Radovic Martin Karlsson Erik Hagersten Uppsala University Sweden TMA A Trap-Based

ICS’06 -- Håkan Zeffer: [email protected] June 30th, 2006

Related Work

SW only Page-based systems

• IVY, Munin, Cashmere, GeNIMA, Treadmarks + many more• Virtual memory used for coherence detection

Fine-grained systems• Shasta, Blizzard, Sirocco, DSZOOM• Coherence checks instrumented into applications

HW support + software protocol FLASH, Typhoon, S3.mp

• Coherence processor executes the coherence protocol

SMTp• SMT thread executes the coherence protocol

Page 19: June 30th, 2006 ICS’06 -- Håkan Zeffer: zeffer@it.uu.se Håkan Zeffer Zoran Radovic Martin Karlsson Erik Hagersten Uppsala University Sweden TMA A Trap-Based

ICS’06 -- Håkan Zeffer: [email protected] June 30th, 2006

Future Work

More mature TMA implementations Coherence detection on physical addresses System (instead of application) scalability

(Proceedings figure text error: Internet pdf is OK!) One proposal is already available as a tech. report

Available at: http://www.it.uu.se/research/publications/reports/2006-031 New coherence detection scheme

• No “false” load or store coherence misses A new way to decouple inter- and intra-chip coherence In DRAM memory remote access caching Commercial applications Much more experiments Very promising results

Page 20: June 30th, 2006 ICS’06 -- Håkan Zeffer: zeffer@it.uu.se Håkan Zeffer Zoran Radovic Martin Karlsson Erik Hagersten Uppsala University Sweden TMA A Trap-Based

ICS’06 -- Håkan Zeffer: [email protected] June 30th, 2006

Conclusions

Shared memory trends SMT and CMP Mid-range servers on a single chip

Trap-based Memory Architecture Design for commercial single chip performance Simple and small HW structures for scalable shared memory

TMA Lite “Minimal” TMA implementation Competitive to HW DSM when flexibility is used Promising for HPC when runtime system is under control

Given the right HW/SW tradeoff simple and efficient scalable shared memory is possible

More mature TMA arch. in next paper (the tech. report)

Page 21: June 30th, 2006 ICS’06 -- Håkan Zeffer: zeffer@it.uu.se Håkan Zeffer Zoran Radovic Martin Karlsson Erik Hagersten Uppsala University Sweden TMA A Trap-Based

ICS’06 -- Håkan Zeffer: [email protected] June 30th, 2006

Questions?

Page 22: June 30th, 2006 ICS’06 -- Håkan Zeffer: zeffer@it.uu.se Håkan Zeffer Zoran Radovic Martin Karlsson Erik Hagersten Uppsala University Sweden TMA A Trap-Based

ICS’06 -- Håkan Zeffer: [email protected] June 30th, 2006

The Coherence Protocol