Simultaneous Multithreading – p. 1
Simultaneous Multithreading
Esen VAROL
YEDİTEPE UNIVERSITY
Simultaneous Multithreading – p. 2
Contents
Advances in TechnologyTypes of ParallelismSimultaneous Multithreading – the ideaComparison of Parallel ProcessorsSimultaneous Multithreading ModelResultsSimultaneous Multithreading IssuesCommercial AspectsReferencesConclusion
Simultaneous Multithreading – p. 3
A Billion Transistors, Possibilities?
Add more memoryIncrease on-chip cache/primary memory
Increase system integrationAdd I/O controllers, graphics accelerators
Enhance computational capabilityIncrease parallelism in all forms
Simultaneous Multithreading – p. 4
Types of Parallelism
Instruction level parallelismPipeliningSuperscalarVery Long Instruction Word
Application level parallelismParallel programmingMultiple threadsMultiple processes
Simultaneous Multithreading – p. 5
Superscalar
Issue multiple instructions in each cycleMultiple issues are not due to pipeliningSeveral functional units of the same type, e.g. ALUsDispatcher reads instructions, decides which can run inparallelIn VLIW, dispatcher complexity moved to compiler
Simultaneous Multithreading – p. 6
Multithreaded Processors
Multiple threads share functional unitsIndependent hardware state of each thread duplicatedTypes of multithreading:
Fine grainedSwitch between threads on each cycle
Coarse grainedSwitch between threads only on costly stalls
Simultaneous Multithreading – p. 7
Simultaneous Multithreading – the Idea
Combine superscalar and multithreadingFrom superscalar
Issue multiple instructions per cycleFrom multithreading
Hardware state for several programs/threadsResult
Issue multiple instructions from multiple threads ineach cycle
Simultaneous Multithreading – p. 8
Comparison
Simultaneous Multithreading – p. 9
SMT Model
Minimal extension of an out-of-order superscalarResources replicated
State for hardware contexts (registers, PCs)Per thread mechanisms for
Pipeline flushingSubroutine returns
Also, per thread identifiers forBranch target bufferTranslation lookaside buffer
Simultaneous Multithreading – p. 10
SMT Model (continued ..)
Resources redesignedInstruction fetch unitProcessor pipeline
Instruction SchedulingDoes not require additional hardwareRegister renaming (same as superscalar)
Simultaneous Multithreading – p. 11
Block Diagram
Simultaneous Multithreading – p. 12
Instruction Fetch Unit
Takes advantage of inter-thread competitionPartitioning bandwidthFetching threads that give maximum local benefit
2.8 fetchingFetch 1 inst. per logical processor, for 2 threadsDecode 1 thread till branch/end of cache line, then
jump to the otherICount feedback
Highest priority to threads with fewest instructions inthe decode, renaming, and queue pipeline stages
Small hardware addition to track queue lengths
Simultaneous Multithreading – p. 13
Register File and Pipeline
Each thread has 32 architectural registersRegister file: 32 * #threads, plus rename registersLarger register file, longer access timeTo avoid increase in clock cycle time, SMT pipelineextended to allow 2 cycle register reads and writes2 cycle reads/writes increase branch mispredictionpenalty
Simultaneous Multithreading – p. 14
Results
ILP and TLP exploited simultaneouslySMT vs. Superscalar
Superscalar unable to exploit TLPSMT vs. Fine-grained multithreading
F.G. eliminated only vertical wasteSMT vs. Multiprocessors
Multiprocessors limited by static resource partitioningHurrah! SMTs performed the best ..
Simultaneous Multithreading – p. 15
SMT Issues – what to fetch
StaticRound-robin8 instructions from one thread or4 instructions from two threads or2 instructions from four threads etc.
DynamicFavour threads with minimal in-flight branchesFavour threads with minimal outstanding missesFavour threads with minimal in-flight instructions
Simultaneous Multithreading – p. 16
SMT Issues – what to issue
Oldest firstCache hit speculated lastBranch speculated firstBranches first
Important result: Unlike superscalar, doesn’t matter much!
Simultaneous Multithreading – p. 17
SMT Issues – Caching
Same cache shared among threadsNo coherence issuesBut, cache conflicts increasePossibility of cache thrashing
Simultaneous Multithreading – p. 18
SMT Issues – Synchronization
Spinlocks not useful (in fact, bad!)Synchronization mechanism needs to be fast, light,scalableSuggested Method (memory based)
acquire(lock):blocks on failureonly completes execution on success
release(lock):writes zero if no other thread blockingelse unblocks the other thread
Simultaneous Multithreading – p. 19
SMT Issues – Compiler optimizations
Compiler should try to minimize cache interference bymultiple threads in the same programLatency hiding techniques like speculation fromuniprocessor environments need to be rethoughtSharing optimization techniques from multiprocessorschange, since data sharing is now good
Simultaneous Multithreading – p. 20
Applications
Biggest application: servers!E.g., server running ApacheUsed by Sun, IBM in high-end servers
Simultaneous Multithreading – p. 21
Commercial SMTs
Compaq Alpha 21464Planned 4T processorAxed in 2001
Pentium IV Xeon2T processorHyperthreading = Intel buzzword for SMT
Sun Ultrasparc IV2T processorAlso a CMP (chip multicore processor)
Simultaneous Multithreading – p. 22
Conclusion
Simple design extension to existing processortechnologyExploits ILP and TLP without sacrificing single threadperformanceOptimized compiler and operating system support wouldimprove performance
Incidentally, Intel has announced plans for a multi-core SMTprocessor.
Simultaneous Multithreading – p. 23
References
Simultaneous Multithreading: A Platform forNext-generation Processors
Eggers, S., Emer, J., Levy, H., Lo, J., Stamm, R., &Tullsen, D.Tuning Compiler Optimizations for SimultaneousMultithreading
Lo, J., Eggers, S., Levy, H., Parekh, S., & Tullsen, D.Supporting Fine-Grain Synchronization on aSimultaneous Multithreaded Processor
Tullsen, D., Lo, J., Eggers, S., & Levy, H.http://lapwww.epfl.ch/courses/advcomparch/
Prof. Paolo Ienne
Simultaneous Multithreading – p. 24
Questions?
Thank You!