Upload
frank-goodwin
View
217
Download
0
Embed Size (px)
Citation preview
ICS’06 -- Håkan Zeffer: [email protected] June 30th, 2006
Håkan Zeffer
Zoran Radovic
Martin Karlsson
Erik Hagersten
Uppsala University
Sweden
TMA
A Trap-Based Memory Architecture
TMA
ICS’06 -- Håkan Zeffer: [email protected] June 30th, 2006
Simultaneous Multithreading (SMT)
Diminishing performance from ILP Increased chip parallelism from hardware threading (TLP)
IBM Power5, Intel Pentium4, Sun T1 (Niagara) “No processor should come without multiple threads” [Dr. Tremblay]
fetch unitdecode, rename etc.
integer pipe
floating-point pipe
memory pipe
branch pipe
L1I L1D
ICS’06 -- Håkan Zeffer: [email protected] June 30th, 2006
Chip Multiprocessors (CMPs)
interconnect
I D I D I D I D
P P P P
L2 L2 L2 L2
Chip Multiprocessors (CMPs)
Piranha, IBM Power4, IBM Power5, Sun UltraSPARC IV+, Sun T1, Intel Duo, AMD Dual-Core Opteron
ICS’06 -- Håkan Zeffer: [email protected] June 30th, 2006
Multi-CMP Systems
CMP 3 CMP 4
CMP 2CMP 1
interconnect
I D I D I D I D
P P P P
L2 L2 L2 L2
Larger systems sometimes built from multiple CMPs Piranha, IBM Power4 and IBM Power5
interconnect
ICS’06 -- Håkan Zeffer: [email protected] June 30th, 2006
Multi-CMP Coherence
Inter-CMP Coherence
Intra-CMP Coherence
Intra-CMP protocol for coherence within CMP Inter-CMP protocol for coherence between CMPs Interactions between protocols increase complexity
CMP 3 CMP 4
CMP 2CMP 1
interconnect
ICS’06 -- Håkan Zeffer: [email protected] June 30th, 2006
Shared-Memory Trends
Today’s chips = yesterday’s mid-range servers Sun T1 has 32 hardware threads on a single die
Is it worth to implement multi-CMP systems? Increased development cost Increased verification cost How big is the market?
ICS’06 -- Håkan Zeffer: [email protected] June 30th, 2006
Trap-Based Memory Architectures
TMA: Trap-based Memory Architecture
Basic idea Optimize for commercial singe-chip performance Let simple HW and SW support enable scalability
Coherence violation detection in hardware Trap on inter-chip coherence violations Solve inter-chip coherence misses in software
ICS’06 -- Håkan Zeffer: [email protected] June 30th, 2006
Outline
Introduction TMA and TMA Lite Evaluation methodology Results Related work Future work Conclusions
ICS’06 -- Håkan Zeffer: [email protected] June 30th, 2006
TMA Lite
TMA Lite is a “minimal” TMA implementation
Runtime system• Deadlock avoidance• Coherence protocol
Per application “scalability” Binary transparency No memory system modifications Simple processor core modifications
An inter-node load coherence check An inter-node store coherence check
ICS’06 -- Håkan Zeffer: [email protected] June 30th, 2006
A TMA Lite System
TMA Lite nodes Single-chip system
• Load and store coherence check support
HW maintains intra-chip coherence
TMA Lite cluster network “InfiniBand like” High-bandwidth Low-latency Remote memory access (put, get and atomic)
TMA Lite software Coherence and consistency between nodes
ICS’06 -- Håkan Zeffer: [email protected] June 30th, 2006
The Load Check
Magic value convention Each cache line in state invalid contains a predefined value
Hardware Comparator at the load path detects this value Trap generated when the value is found
magic value register
=?
data
&
load check enabled?
load trap?
Controlled by system software
False misses When the magic value is used within an application Easy to detect and solve within the coherence protocol Rare
ICS’06 -- Håkan Zeffer: [email protected] June 30th, 2006
The Store Check
Write permission cache (WPC) Can be seen as a very small cache Operates on virtual addresses Accessed in parallel with the data TLB Write permission for lines in the WPC guaranteed by protocol
trap?
trap?dataTLB
WPC
Data L1
Addressgeneration
TLB accessWPC accessStart L1 access
Tagcompare
TLB trap?WPC trap?End L1 access
...
hit?
data
The write permission cache has to be filled A fill occurs at all WPC misses Even if the node already has write permission Overhead often severe
ICS’06 -- Håkan Zeffer: [email protected] June 30th, 2006
Simulator and Benchmarks
Simics: full-system simulator
Vasa: timing- and memory-model extension Cycle accurate Power5 like SMT processor model Latency and bandwidth of caches, memory and network
SPLASH-2 benchmarks
ICS’06 -- Håkan Zeffer: [email protected] June 30th, 2006
System Parameters
Scaled down Power5 chip 1 or 2 processor cores per chip 2 SMT threads per processor core Write through L1 Write back L2 and L3
• L2 on-die, L3 tags on-die
The HW distributed shared memory system Directory: fully mapped bit vector, dedicated SRAM Coherence protocol: HW, highly optimized, non-blocking
The TMA Lite system Directory: fully mapped bit vector, in ordinary DRAM memory Coherence protocol
• SW• Binary patch to Solaris modifies the trap vector• Coherence protocol run on the hardware thread that caused the miss
ICS’06 -- Håkan Zeffer: [email protected] June 30th, 2006
Execution Time Breakdown
Execution time is normalized to the HW DSM.4 nodes, load comparator + 16 entry WPC.
ICS’06 -- Håkan Zeffer: [email protected] June 30th, 2006
Coherence Protocol Breakdown
ICS’06 -- Håkan Zeffer: [email protected] June 30th, 2006
SW Flexibility: Coherence Unit Size
Execution time is normalized to the HW DSM.4 nodes, load comparator + 16 entry WPC.
ICS’06 -- Håkan Zeffer: [email protected] June 30th, 2006
Related Work
SW only Page-based systems
• IVY, Munin, Cashmere, GeNIMA, Treadmarks + many more• Virtual memory used for coherence detection
Fine-grained systems• Shasta, Blizzard, Sirocco, DSZOOM• Coherence checks instrumented into applications
HW support + software protocol FLASH, Typhoon, S3.mp
• Coherence processor executes the coherence protocol
SMTp• SMT thread executes the coherence protocol
ICS’06 -- Håkan Zeffer: [email protected] June 30th, 2006
Future Work
More mature TMA implementations Coherence detection on physical addresses System (instead of application) scalability
(Proceedings figure text error: Internet pdf is OK!) One proposal is already available as a tech. report
Available at: http://www.it.uu.se/research/publications/reports/2006-031 New coherence detection scheme
• No “false” load or store coherence misses A new way to decouple inter- and intra-chip coherence In DRAM memory remote access caching Commercial applications Much more experiments Very promising results
ICS’06 -- Håkan Zeffer: [email protected] June 30th, 2006
Conclusions
Shared memory trends SMT and CMP Mid-range servers on a single chip
Trap-based Memory Architecture Design for commercial single chip performance Simple and small HW structures for scalable shared memory
TMA Lite “Minimal” TMA implementation Competitive to HW DSM when flexibility is used Promising for HPC when runtime system is under control
Given the right HW/SW tradeoff simple and efficient scalable shared memory is possible
More mature TMA arch. in next paper (the tech. report)
ICS’06 -- Håkan Zeffer: [email protected] June 30th, 2006
Questions?
ICS’06 -- Håkan Zeffer: [email protected] June 30th, 2006
The Coherence Protocol