Upload
lilian-lang
View
217
Download
0
Embed Size (px)
Citation preview
SMTp: An Architecture for SMTp: An Architecture for Next-generation Scalable Next-generation Scalable
Multi-threadingMulti-threadingMainak ChaudhuriMainak Chaudhuri
Computer Systems LaboratoryComputer Systems Laboratory
Cornell UniversityCornell University
Mark HeinrichMark Heinrich
School of Computer ScienceSchool of Computer Science
University of Central FloridaUniversity of Central Florida
Scalable multi-threadingScalable multi-threadingDirectory-based hardware DSMDirectory-based hardware DSM Directory-based coherence: complex MCsDirectory-based coherence: complex MCs So complex that MCs can be programmable with So complex that MCs can be programmable with
embedded protocol processorsembedded protocol processors
Integrated memory controllers are common-Integrated memory controllers are common-place in high-end microprocessorsplace in high-end microprocessors Servers are naturally NUMA/DSM, not SMPServers are naturally NUMA/DSM, not SMP Snooping is awkward and BW-limitedSnooping is awkward and BW-limited
This talk: build directory-based scalable DSM This talk: build directory-based scalable DSM with nominal changes to standard MC with nominal changes to standard MC
Two major goalsTwo major goals
Directory-based coherence without a Directory-based coherence without a directory controllerdirectory controller still scalablestill scalable can use less complex standard memory can use less complex standard memory
controllerscontrollers
Flexibility in using custom protocol code or Flexibility in using custom protocol code or any software sequences to do “interesting any software sequences to do “interesting things” on cache missesthings” on cache misses compression/encryptioncompression/encryption fault tolerancefault tolerance
OutlineOutline Introducing SMTpIntroducing SMTpBasic extensions for SMTpBasic extensions for SMTpDeadlock avoidanceDeadlock avoidanceEvaluation methodologyEvaluation methodologySimulation resultsSimulation resultsRelated workRelated workConclusionsConclusions
Introducing SMTpIntroducing SMTpSMTp: SMTp: SMTSMT with awith a pprotocolrotocol thread contextthread context
Protocol thread executes the control part of the Protocol thread executes the control part of the coherence protocol in parallel with SDRAM data coherence protocol in parallel with SDRAM data accessaccess
Provides flexibility to run custom software Provides flexibility to run custom software sequences on cache misses sequences on cache misses [motivation#1][motivation#1]
Still uses the standard MC (no directory state Still uses the standard MC (no directory state machine) machine) [motivation#2][motivation#2]
Build large-scale directory-based DSM out of Build large-scale directory-based DSM out of commodity nodes with commodity nodes with integratedintegrated MC and SMTp MC and SMTp
OutlineOutline Introducing SMTpIntroducing SMTp Basic extensions for SMTpBasic extensions for SMTp
Deadlock avoidanceDeadlock avoidance
Evaluation methodologyEvaluation methodology
Simulation resultsSimulation results
Related workRelated work
ConclusionsConclusions
Basic extensions for SMTpBasic extensions for SMTp
INTEGRATED MEMORY CONTROLLER
L2CACHE
L2 BB
PPCV
LA
LDCTXT_ID
ICFE
IBB
DE RE
IQ
LSQ
FPQ
REGFILE
ALU
AGU
FPU
DC G
DBB
1 bit
1 bit
7 bits
16x64B 16x32B
16x128B
App. MissProtocol Miss
Uncachedload/store
L1 Miss
Memory controller for SMTpMemory controller for SMTp
LOCAL MISS INTERFACE
HANDLERDISPATCH
NETWORKINTERFACE
ADDR. HEADER
SDRAM
APP. DATA
PROTOCOLDATA
PPCV,LA LDCTXT_ID
Uncached ld/st Protocol miss App. miss
To/From Router
NIHandler
Local Miss Handler
Miss Refill
NIIn
NI Out
8x128B
Enabling a protocol threadEnabling a protocol thread
Statically bound to a thread contextStatically bound to a thread context Need an extra thread context (PC, RAS, register Need an extra thread context (PC, RAS, register
map)map) No context switchNo context switch
Not visible to kernelNot visible to kernel
Protocol code is provided by system Protocol code is provided by system (conventional DSM style)(conventional DSM style)
User cannot download arbitrary code to User cannot download arbitrary code to protocol memoryprotocol memory
Anatomy of a protocol handlerAnatomy of a protocol handlerMIPS style RISC ISAMIPS style RISC ISA
Short sequence of instructionsShort sequence of instructionsCalculate directory address // simple hash function.Load directory entry // normal cached load.Compute on header and directory // integer arithmetic.Send cache line/control message // uncached stores.switch r17 // uncached load (header)ldctxt r18 // uncached load (address)
Fetching from protocol threadFetching from protocol thread
NI LMI
HANDLERDISPATCH
PPC PPCV
ICFE
ADDR. HEADER
LSQ
JUMPTABLE
Router Front side bus
SDRAM
Fetching from protocol threadFetching from protocol thread
NI LMI
HANDLERDISPATCH
PPC PPCV
ICFE
ADDR. HEADER
LSQ
JUMPTABLE
Router Front side bus
SDRAM
Fetching from protocol threadFetching from protocol thread
NI LMI
HANDLERDISPATCH
PPC PPCV
ICFE
ADDR. HEADER
LSQ
JUMPTABLE
Router Front side bus
Unblock switch
SDRAM
Fetching from protocol threadFetching from protocol thread
NI LMI
HANDLERDISPATCH
PPC PPCV
ICFE
ADDR. HEADER
LSQ
JUMPTABLE
Router Front side bus
Execute ldctxt
SDRAM
Fetching from protocol threadFetching from protocol thread
NI LMI
HANDLERDISPATCH
PPC PPCV
ICFE
ADDR. HEADER
LSQ
JUMPTABLE
Router Front side bus
SDRAM
(at home)
Fetching from protocol threadFetching from protocol thread
NI LMI
HANDLERDISPATCH
PPC PPCV
ICFE
ADDR. HEADER
LSQ
JUMPTABLE
Router Front side bus
SDRAM
(at home)
Fetching from protocol threadFetching from protocol thread
Protocol code/data resides in unmapped Protocol code/data resides in unmapped portion of local SDRAMportion of local SDRAM
No ITLB accessNo ITLB access
Share instruction cache with application Share instruction cache with application thread(s)thread(s)
Fetcher turns off PPCV after the last Fetcher turns off PPCV after the last handler instruction is fetchedhandler instruction is fetched
Handling protocol load/storeHandling protocol load/store
No DTLB accessNo DTLB access
Share L1 data and L2 cachesShare L1 data and L2 caches
L2 cache miss from protocol thread L2 cache miss from protocol thread behaves differentlybehaves differently
Needs to bypass Local Miss InterfaceNeeds to bypass Local Miss Interface
Talks to local SDRAM directlyTalks to local SDRAM directly
OutlineOutline Introducing SMTpIntroducing SMTpBasic extensions for SMTpBasic extensions for SMTp Deadlock avoidanceDeadlock avoidance
Evaluation methodologyEvaluation methodology
Simulation resultsSimulation results
Related workRelated work
ConclusionsConclusions
Deadlock with shared resourcesDeadlock with shared resourcesProgress of app. L2 miss depends on Progress of app. L2 miss depends on
progress of protocol threadprogress of protocol thread
Resources involved: front-end queue slots, Resources involved: front-end queue slots, branch stack space, integer registers, branch stack space, integer registers, integer queue slots, LSQ slots, speculative integer queue slots, LSQ slots, speculative store buffers, MSHRs, and store buffers, MSHRs, and cache indexcache index
ROB
LOAD Retire ptr.
Allocate ptr.
L2 miss
Local miss handler
Protocol instructionBLOCKED IQ Full
Solving resource deadlockSolving resource deadlockGeneral solution: one reserved instanceGeneral solution: one reserved instance
Out of 8 decode queue slots app. threads Out of 8 decode queue slots app. threads get 7 while all 8 are open to protocol threadget 7 while all 8 are open to protocol thread
Easier solution: Pentium 4 style static Easier solution: Pentium 4 style static resource partitioningresource partitioning
Cache index conflict:Cache index conflict: Solution: L1 and L2 bypass buffers (FA/LRU)Solution: L1 and L2 bypass buffers (FA/LRU) Allocate a bypass buffer entry insteadAllocate a bypass buffer entry instead Parallel lookup: hit latency unchangedParallel lookup: hit latency unchanged
SMTp: deadlock solutionSMTp: deadlock solution
INTEGRATED MEMORY CONTROLLER
L2CACHE
L2 BB
PPCV
LA
LDCTXT_ID
ICFE
IBB
DE RE
IQ
LSQ
FPQ
REGFILE
ALU
AGU
FPU
DC G
DBB
1 bit
1 bit
7 bits
16x64B 16x32B
16x128B
App. MissProtocol Miss
Uncachedload/store
L1 Miss
OutlineOutline Introducing SMTpIntroducing SMTpBasic extensions for SMTpBasic extensions for SMTpDeadlock avoidanceDeadlock avoidance Evaluation methodologyEvaluation methodology
Simulation resultsSimulation results
Related workRelated work
ConclusionsConclusions
Evaluation methodologyEvaluation methodologyApplicationsApplications SPLASH-2: FFT, LU, Radix, Ocean, WaterSPLASH-2: FFT, LU, Radix, Ocean, Water FFTWFFTW
Simulated machine model (details in paper)Simulated machine model (details in paper) 2GHz, 9 pipe stages2GHz, 9 pipe stages 1, 2, 4 app. threads + one protocol context1, 2, 4 app. threads + one protocol context ROB: 128 (per thread)ROB: 128 (per thread) Integer/floating point registers: 160/192/256Integer/floating point registers: 160/192/256 L1 Icache: 32 KB/64B/2-way/LRU/1 cycleL1 Icache: 32 KB/64B/2-way/LRU/1 cycle L1 Dcache: 32 KB/32B/2-way/LRU/1 cycleL1 Dcache: 32 KB/32B/2-way/LRU/1 cycle Unified L2: 2 MB/128B/8-way/LRU/9 cyclesUnified L2: 2 MB/128B/8-way/LRU/9 cycles
Simulated machine modelsSimulated machine modelsModelModel MCMC PPPP MC, PP MC, PP
frequencyfrequencyProtocol Protocol
D$D$
BaseBase Non-int.Non-int. 2-issue 2-issue 400 MHz400 MHz 512 KB 512 KB DMDM
IntPerfectIntPerfect Int.Int. 2-issue2-issue Proc. coreProc. core PerfectPerfect
Int512KBInt512KB Int.Int. 2-issue2-issue ½ core½ core 512 KB 512 KB DMDM
Int64KBInt64KB Int.Int. 2-issue2-issue ½ core½ core 64 KB 64 KB DMDM
SMTpSMTp Int.Int. NoneNone ½ core½ core NoneNone
OutlineOutline Introducing SMTpIntroducing SMTpBasic extensions for SMTpBasic extensions for SMTpDeadlock avoidanceDeadlock avoidanceEvaluation methodologyEvaluation methodology Simulation resultsSimulation results
Related workRelated work
ConclusionsConclusions
Single node (1app,1prot) resultsSingle node (1app,1prot) results
Single node (2app,1prot) resultsSingle node (2app,1prot) results
Single node results: summarySingle node results: summaryMemory controller integration helpsMemory controller integration helps
Ocean and FFTW get maximum benefitOcean and FFTW get maximum benefit
LU and Water are largely insensitiveLU and Water are largely insensitive
SMTp is always faster than BaseSMTp is always faster than Base
SMTp performs on par with Int512KBSMTp performs on par with Int512KB
In a few cases Int512KB outperforms SMTp In a few cases Int512KB outperforms SMTp by at most 1.6%by at most 1.6%
Int64KB suffers from directory cache missesInt64KB suffers from directory cache misses FFTW and Radix-Sort are most sensitiveFFTW and Radix-Sort are most sensitive
32-node (1app,1prot) results32-node (1app,1prot) results
32-node (2app,1prot) results32-node (2app,1prot) results
Multi-node results: summaryMulti-node results: summary
With increasing system size integrated With increasing system size integrated models converge in terms of performancemodels converge in terms of performance
IntPerfect gets a slight edge due to double IntPerfect gets a slight edge due to double memory controller speedmemory controller speed
SMTp continues to deliver excellent SMTp continues to deliver excellent performanceperformance
The gap between Int512KB and SMTp: at The gap between Int512KB and SMTp: at most 6%, on average samemost 6%, on average same
Resource occupancy: summaryResource occupancy: summaryProtocol thread is active for very small Protocol thread is active for very small amount of time (low protocol occupancy)amount of time (low protocol occupancy)
When active, can have high peak resource When active, can have high peak resource occupancyoccupancy
When idle, all resources are freed exceptWhen idle, all resources are freed except 31 mapped registers31 mapped registers 2 LSQ slots holding switch and ldctxt2 LSQ slots holding switch and ldctxt
Overall, protocol thread has very low pipeline Overall, protocol thread has very low pipeline overheadoverhead
OutlineOutline Introducing SMTpIntroducing SMTpBasic extensions for SMTpBasic extensions for SMTpDeadlock avoidanceDeadlock avoidanceEvaluation methodologyEvaluation methodologySimulation resultsSimulation results Related workRelated work
ConclusionsConclusions
Related workRelated workSimultaneous multi-threadingSimultaneous multi-threading Assisted execution Assisted execution [HPCA’01][MICRO’01][ISCA’02][HPCA’01][MICRO’01][ISCA’02]
Fault tolerance Fault tolerance [ASPLOS’00][ISCA’02][ASPLOS’00][ISCA’02]
User-level message passing User-level message passing [MTEAC’01][MTEAC’01]
Programmable protocol engineProgrammable protocol engine Customized co-processor (FLASH, S3.mp, STiNG, Customized co-processor (FLASH, S3.mp, STiNG,
Piranha)Piranha) Commodity off-the-shelf processor (Typhoon)Commodity off-the-shelf processor (Typhoon) On main processor through low overhead interrupt On main processor through low overhead interrupt
(Chalmers) (Chalmers) [ISCA’95][ISCA’95]
OutlineOutline Introducing SMTpIntroducing SMTpBasic extensions for SMTpBasic extensions for SMTpDeadlock avoidanceDeadlock avoidanceEvaluation methodologyEvaluation methodologySimulation resultsSimulation resultsRelated workRelated work ConclusionsConclusions
ConclusionsConclusionsFirst design to exploit SMT to run directory-First design to exploit SMT to run directory-based coherence protocol on spare threadsbased coherence protocol on spare threads
Delivers performance close to (within 6%, Delivers performance close to (within 6%, average 0%) integrated coherence average 0%) integrated coherence controllers with large (512 KB) stand-alone controllers with large (512 KB) stand-alone directory data cachesdirectory data caches
Extremely low pipeline overhead Extremely low pipeline overhead
SMTp provides an opportunity to build SMTp provides an opportunity to build scalable directory-based DSMs with minor scalable directory-based DSMs with minor changes to commodity nodeschanges to commodity nodes
Future directionsFuture directionsNeed not be restricted to building DSMs out Need not be restricted to building DSMs out of commodity nodes onlyof commodity nodes only
Use SMTp to carry outUse SMTp to carry out On-the-fly compression/encryption of L2 cache On-the-fly compression/encryption of L2 cache
lineslines Software controlled address remapping to Software controlled address remapping to
improve locality of cache accessimprove locality of cache access Fault tolerance by selectively extending Fault tolerance by selectively extending
coherence protocolscoherence protocols
Alternate CMP designAlternate CMP design Issues with multiple protocol threadsIssues with multiple protocol threads
SMTp: An Architecture for SMTp: An Architecture for Next-generation Scalable Next-generation Scalable
Multi-threadingMulti-threadingMainak ChaudhuriMainak Chaudhuri
Computer Systems LaboratoryComputer Systems Laboratory
Cornell UniversityCornell University
Mark HeinrichMark Heinrich
School of Computer ScienceSchool of Computer Science
University of Central FloridaUniversity of Central Florida
Protocol occupancyProtocol occupancy
16 nodes, (1a,1p) threads per node
Protocol thread characteristicsProtocol thread characteristics
16 nodes, (1a,1p) threads per node