SMTp: An Architecture for Next-generation Scalable Multi-threading Mainak Chaudhuri Computer Systems Laboratory Cornell University Mark Heinrich School

SMTp: An Architecture for SMTp: An Architecture for Next-generation Scalable Next-generation Scalable

Multi-threadingMulti-threadingMainak ChaudhuriMainak Chaudhuri

Computer Systems LaboratoryComputer Systems Laboratory

Cornell UniversityCornell University

Mark HeinrichMark Heinrich

School of Computer ScienceSchool of Computer Science

University of Central FloridaUniversity of Central Florida

Scalable multi-threadingScalable multi-threadingDirectory-based hardware DSMDirectory-based hardware DSM Directory-based coherence: complex MCsDirectory-based coherence: complex MCs So complex that MCs can be programmable with So complex that MCs can be programmable with

embedded protocol processorsembedded protocol processors

Integrated memory controllers are common-Integrated memory controllers are common-place in high-end microprocessorsplace in high-end microprocessors Servers are naturally NUMA/DSM, not SMPServers are naturally NUMA/DSM, not SMP Snooping is awkward and BW-limitedSnooping is awkward and BW-limited

This talk: build directory-based scalable DSM This talk: build directory-based scalable DSM with nominal changes to standard MC with nominal changes to standard MC

Two major goalsTwo major goals

Directory-based coherence without a Directory-based coherence without a directory controllerdirectory controller still scalablestill scalable can use less complex standard memory can use less complex standard memory

controllerscontrollers

Flexibility in using custom protocol code or Flexibility in using custom protocol code or any software sequences to do “interesting any software sequences to do “interesting things” on cache missesthings” on cache misses compression/encryptioncompression/encryption fault tolerancefault tolerance

OutlineOutline Introducing SMTpIntroducing SMTpBasic extensions for SMTpBasic extensions for SMTpDeadlock avoidanceDeadlock avoidanceEvaluation methodologyEvaluation methodologySimulation resultsSimulation resultsRelated workRelated workConclusionsConclusions

Introducing SMTpIntroducing SMTpSMTp: SMTp: SMTSMT with awith a pprotocolrotocol thread contextthread context

Protocol thread executes the control part of the Protocol thread executes the control part of the coherence protocol in parallel with SDRAM data coherence protocol in parallel with SDRAM data accessaccess

Provides flexibility to run custom software Provides flexibility to run custom software sequences on cache misses sequences on cache misses [motivation#1][motivation#1]

Still uses the standard MC (no directory state Still uses the standard MC (no directory state machine) machine) [motivation#2][motivation#2]

Build large-scale directory-based DSM out of Build large-scale directory-based DSM out of commodity nodes with commodity nodes with integratedintegrated MC and SMTp MC and SMTp

OutlineOutline Introducing SMTpIntroducing SMTp Basic extensions for SMTpBasic extensions for SMTp

Deadlock avoidanceDeadlock avoidance

Evaluation methodologyEvaluation methodology

Simulation resultsSimulation results

Related workRelated work

ConclusionsConclusions

Basic extensions for SMTpBasic extensions for SMTp

INTEGRATED MEMORY CONTROLLER

L2CACHE

L2 BB

PPCV

LA

LDCTXT_ID

ICFE

IBB

DE RE

IQ

LSQ

FPQ

REGFILE

ALU

AGU

FPU

DC G

DBB

1 bit

1 bit

7 bits

16x64B 16x32B

16x128B

App. MissProtocol Miss

Uncachedload/store

L1 Miss

Memory controller for SMTpMemory controller for SMTp

LOCAL MISS INTERFACE

HANDLERDISPATCH

NETWORKINTERFACE

ADDR. HEADER

SDRAM

APP. DATA

PROTOCOLDATA

PPCV,LA LDCTXT_ID

Uncached ld/st Protocol miss App. miss

To/From Router

NIHandler

Local Miss Handler

Miss Refill

NIIn

NI Out

8x128B

Enabling a protocol threadEnabling a protocol thread

Statically bound to a thread contextStatically bound to a thread context Need an extra thread context (PC, RAS, register Need an extra thread context (PC, RAS, register

map)map) No context switchNo context switch

Not visible to kernelNot visible to kernel

Protocol code is provided by system Protocol code is provided by system (conventional DSM style)(conventional DSM style)

User cannot download arbitrary code to User cannot download arbitrary code to protocol memoryprotocol memory

Anatomy of a protocol handlerAnatomy of a protocol handlerMIPS style RISC ISAMIPS style RISC ISA

Short sequence of instructionsShort sequence of instructionsCalculate directory address // simple hash function.Load directory entry // normal cached load.Compute on header and directory // integer arithmetic.Send cache line/control message // uncached stores.switch r17 // uncached load (header)ldctxt r18 // uncached load (address)

Fetching from protocol threadFetching from protocol thread

NI LMI

HANDLERDISPATCH

PPC PPCV

ICFE

ADDR. HEADER

LSQ

JUMPTABLE

Router Front side bus

SDRAM


NI LMI

HANDLERDISPATCH

PPC PPCV

ICFE

ADDR. HEADER

LSQ

JUMPTABLE


SDRAM


NI LMI

HANDLERDISPATCH

PPC PPCV

ICFE

ADDR. HEADER

LSQ

JUMPTABLE


Unblock switch

SDRAM


NI LMI

HANDLERDISPATCH

PPC PPCV

ICFE

ADDR. HEADER

LSQ

JUMPTABLE


Execute ldctxt

SDRAM


NI LMI

HANDLERDISPATCH

PPC PPCV

ICFE

ADDR. HEADER

LSQ

JUMPTABLE


SDRAM

(at home)


NI LMI

HANDLERDISPATCH

PPC PPCV

ICFE

ADDR. HEADER

LSQ

JUMPTABLE


SDRAM

(at home)


Protocol code/data resides in unmapped Protocol code/data resides in unmapped portion of local SDRAMportion of local SDRAM

No ITLB accessNo ITLB access

Share instruction cache with application Share instruction cache with application thread(s)thread(s)

Fetcher turns off PPCV after the last Fetcher turns off PPCV after the last handler instruction is fetchedhandler instruction is fetched

Handling protocol load/storeHandling protocol load/store

No DTLB accessNo DTLB access

Share L1 data and L2 cachesShare L1 data and L2 caches

L2 cache miss from protocol thread L2 cache miss from protocol thread behaves differentlybehaves differently

Needs to bypass Local Miss InterfaceNeeds to bypass Local Miss Interface

Talks to local SDRAM directlyTalks to local SDRAM directly

OutlineOutline Introducing SMTpIntroducing SMTpBasic extensions for SMTpBasic extensions for SMTp Deadlock avoidanceDeadlock avoidance

Evaluation methodologyEvaluation methodology




Deadlock with shared resourcesDeadlock with shared resourcesProgress of app. L2 miss depends on Progress of app. L2 miss depends on

progress of protocol threadprogress of protocol thread

Resources involved: front-end queue slots, Resources involved: front-end queue slots, branch stack space, integer registers, branch stack space, integer registers, integer queue slots, LSQ slots, speculative integer queue slots, LSQ slots, speculative store buffers, MSHRs, and store buffers, MSHRs, and cache indexcache index

ROB

LOAD Retire ptr.

Allocate ptr.

L2 miss

Local miss handler

Protocol instructionBLOCKED IQ Full

Solving resource deadlockSolving resource deadlockGeneral solution: one reserved instanceGeneral solution: one reserved instance

Out of 8 decode queue slots app. threads Out of 8 decode queue slots app. threads get 7 while all 8 are open to protocol threadget 7 while all 8 are open to protocol thread

Easier solution: Pentium 4 style static Easier solution: Pentium 4 style static resource partitioningresource partitioning

Cache index conflict:Cache index conflict: Solution: L1 and L2 bypass buffers (FA/LRU)Solution: L1 and L2 bypass buffers (FA/LRU) Allocate a bypass buffer entry insteadAllocate a bypass buffer entry instead Parallel lookup: hit latency unchangedParallel lookup: hit latency unchanged

SMTp: deadlock solutionSMTp: deadlock solution

INTEGRATED MEMORY CONTROLLER

L2CACHE

L2 BB

PPCV

LA

LDCTXT_ID

ICFE

IBB

DE RE

IQ

LSQ

FPQ

REGFILE

ALU

AGU

FPU

DC G

DBB

1 bit

1 bit

7 bits

16x64B 16x32B

16x128B

App. MissProtocol Miss

Uncachedload/store

L1 Miss

OutlineOutline Introducing SMTpIntroducing SMTpBasic extensions for SMTpBasic extensions for SMTpDeadlock avoidanceDeadlock avoidance Evaluation methodologyEvaluation methodology




Evaluation methodologyEvaluation methodologyApplicationsApplications SPLASH-2: FFT, LU, Radix, Ocean, WaterSPLASH-2: FFT, LU, Radix, Ocean, Water FFTWFFTW

Simulated machine model (details in paper)Simulated machine model (details in paper) 2GHz, 9 pipe stages2GHz, 9 pipe stages 1, 2, 4 app. threads + one protocol context1, 2, 4 app. threads + one protocol context ROB: 128 (per thread)ROB: 128 (per thread) Integer/floating point registers: 160/192/256Integer/floating point registers: 160/192/256 L1 Icache: 32 KB/64B/2-way/LRU/1 cycleL1 Icache: 32 KB/64B/2-way/LRU/1 cycle L1 Dcache: 32 KB/32B/2-way/LRU/1 cycleL1 Dcache: 32 KB/32B/2-way/LRU/1 cycle Unified L2: 2 MB/128B/8-way/LRU/9 cyclesUnified L2: 2 MB/128B/8-way/LRU/9 cycles

Simulated machine modelsSimulated machine modelsModelModel MCMC PPPP MC, PP MC, PP

frequencyfrequencyProtocol Protocol

D$D$

BaseBase Non-int.Non-int. 2-issue 2-issue 400 MHz400 MHz 512 KB 512 KB DMDM

IntPerfectIntPerfect Int.Int. 2-issue2-issue Proc. coreProc. core PerfectPerfect

Int512KBInt512KB Int.Int. 2-issue2-issue ½ core½ core 512 KB 512 KB DMDM

Int64KBInt64KB Int.Int. 2-issue2-issue ½ core½ core 64 KB 64 KB DMDM

SMTpSMTp Int.Int. NoneNone ½ core½ core NoneNone

OutlineOutline Introducing SMTpIntroducing SMTpBasic extensions for SMTpBasic extensions for SMTpDeadlock avoidanceDeadlock avoidanceEvaluation methodologyEvaluation methodology Simulation resultsSimulation results



Single node (1app,1prot) resultsSingle node (1app,1prot) results

Single node (2app,1prot) resultsSingle node (2app,1prot) results

Single node results: summarySingle node results: summaryMemory controller integration helpsMemory controller integration helps

Ocean and FFTW get maximum benefitOcean and FFTW get maximum benefit

LU and Water are largely insensitiveLU and Water are largely insensitive

SMTp is always faster than BaseSMTp is always faster than Base

SMTp performs on par with Int512KBSMTp performs on par with Int512KB

In a few cases Int512KB outperforms SMTp In a few cases Int512KB outperforms SMTp by at most 1.6%by at most 1.6%

Int64KB suffers from directory cache missesInt64KB suffers from directory cache misses FFTW and Radix-Sort are most sensitiveFFTW and Radix-Sort are most sensitive

32-node (1app,1prot) results32-node (1app,1prot) results

32-node (2app,1prot) results32-node (2app,1prot) results

Multi-node results: summaryMulti-node results: summary

With increasing system size integrated With increasing system size integrated models converge in terms of performancemodels converge in terms of performance

IntPerfect gets a slight edge due to double IntPerfect gets a slight edge due to double memory controller speedmemory controller speed

SMTp continues to deliver excellent SMTp continues to deliver excellent performanceperformance

The gap between Int512KB and SMTp: at The gap between Int512KB and SMTp: at most 6%, on average samemost 6%, on average same

Resource occupancy: summaryResource occupancy: summaryProtocol thread is active for very small Protocol thread is active for very small amount of time (low protocol occupancy)amount of time (low protocol occupancy)

When active, can have high peak resource When active, can have high peak resource occupancyoccupancy

When idle, all resources are freed exceptWhen idle, all resources are freed except 31 mapped registers31 mapped registers 2 LSQ slots holding switch and ldctxt2 LSQ slots holding switch and ldctxt

Overall, protocol thread has very low pipeline Overall, protocol thread has very low pipeline overheadoverhead

OutlineOutline Introducing SMTpIntroducing SMTpBasic extensions for SMTpBasic extensions for SMTpDeadlock avoidanceDeadlock avoidanceEvaluation methodologyEvaluation methodologySimulation resultsSimulation results Related workRelated work


Related workRelated workSimultaneous multi-threadingSimultaneous multi-threading Assisted execution Assisted execution [HPCA’01][MICRO’01][ISCA’02][HPCA’01][MICRO’01][ISCA’02]

Fault tolerance Fault tolerance [ASPLOS’00][ISCA’02][ASPLOS’00][ISCA’02]

User-level message passing User-level message passing [MTEAC’01][MTEAC’01]

Programmable protocol engineProgrammable protocol engine Customized co-processor (FLASH, S3.mp, STiNG, Customized co-processor (FLASH, S3.mp, STiNG,

Piranha)Piranha) Commodity off-the-shelf processor (Typhoon)Commodity off-the-shelf processor (Typhoon) On main processor through low overhead interrupt On main processor through low overhead interrupt

(Chalmers) (Chalmers) [ISCA’95][ISCA’95]

OutlineOutline Introducing SMTpIntroducing SMTpBasic extensions for SMTpBasic extensions for SMTpDeadlock avoidanceDeadlock avoidanceEvaluation methodologyEvaluation methodologySimulation resultsSimulation resultsRelated workRelated work ConclusionsConclusions

ConclusionsConclusionsFirst design to exploit SMT to run directory-First design to exploit SMT to run directory-based coherence protocol on spare threadsbased coherence protocol on spare threads

Delivers performance close to (within 6%, Delivers performance close to (within 6%, average 0%) integrated coherence average 0%) integrated coherence controllers with large (512 KB) stand-alone controllers with large (512 KB) stand-alone directory data cachesdirectory data caches

Extremely low pipeline overhead Extremely low pipeline overhead

SMTp provides an opportunity to build SMTp provides an opportunity to build scalable directory-based DSMs with minor scalable directory-based DSMs with minor changes to commodity nodeschanges to commodity nodes

Future directionsFuture directionsNeed not be restricted to building DSMs out Need not be restricted to building DSMs out of commodity nodes onlyof commodity nodes only

Use SMTp to carry outUse SMTp to carry out On-the-fly compression/encryption of L2 cache On-the-fly compression/encryption of L2 cache

lineslines Software controlled address remapping to Software controlled address remapping to

improve locality of cache accessimprove locality of cache access Fault tolerance by selectively extending Fault tolerance by selectively extending

coherence protocolscoherence protocols

Alternate CMP designAlternate CMP design Issues with multiple protocol threadsIssues with multiple protocol threads

SMTp: An Architecture for SMTp: An Architecture for Next-generation Scalable Next-generation Scalable

Multi-threadingMulti-threadingMainak ChaudhuriMainak Chaudhuri

Computer Systems LaboratoryComputer Systems Laboratory

Cornell UniversityCornell University

Mark HeinrichMark Heinrich

School of Computer ScienceSchool of Computer Science

University of Central FloridaUniversity of Central Florida

Protocol occupancyProtocol occupancy

16 nodes, (1a,1p) threads per node

Protocol thread characteristicsProtocol thread characteristics

16 nodes, (1a,1p) threads per node

Documents

SMTp: An Architecture for Next-generation Scalable Multi-threading Mainak Chaudhuri Computer Systems Laboratory Cornell University Mark Heinrich School