View
214
Download
0
Category
Tags:
Preview:
Citation preview
Combining Thread Level Speculation, Helper Threads,
and Runahead ExecutionPolychronis Xekalakis, Nikolas Ioannou and
Marcelo Cintra
University of Edinburghhttp://www.homepages.inf.ed.ac.uk/mc/
Projects/VESPA
ICS 2009 2
Introduction
Single core, out-of-order cores don’t scale– Simpler solution: multi-core architectures
No speedup for single thread applications– Use Thread Level Speculation to extract TLP– Use Helper Threads or RunAhead to improve
ILP However for different apps. (or phases)
some models work better than some others Our Proposal:
– Combine these execution models– Decide at runtime when to employ them
ICS 2009 3
Contributions
Introduce mixed Speculative Multithreading (SM) Execution Models
Design one that combines TLS, HT and RA
Propose a performance model able to quantify ILP and TLP benefits
Unified approach outperforms state-of-the-art SM models:– TLS by 10.2% avg. (up to 41.2%)– RA by 18.3 % avg. (up to 35.2%)
ICS 2009 4
Outline
Introduction Speculative Multithreading Models Performance Model Unified Scheme Experimental Setup and Results Conclusions
Helper Threads
Compiler deals with:– Memory ops miss/
hard-to-predict branches
– Backward slices
HW deals with:– Spawn threads– Different context– Discard when
finished
Benefit:– ILP
(Prefetch/Warmup) ICS 2009 5
RunAhead Execution
Compiler deals with:– Nothing
HW deals with:– Different context– When to do RA– VP Memory– Commit/Discard
Benefit:– ILP (Prefetch/Warmup)
ICS 2009 6
ICS 2009 7
Thread Level Speculation
Compiler deals with:– Task selection– Code generation
HW deals with:– Different context– Spawn threads– Detecting violations– Replaying – Arbitrate commit
Benefit: TLP/ILP– TLP (Overlapped
Execution) + ILP (Prefetching)
ICS 2009 8
Outline
Introduction Speculative Multithreading Models Performance Model Unified Scheme Experimental Setup and Results Conclusions
ICS 2009 9
Understanding Performance Benefits Complex TLS thread interactions,
obscure performance benefits Even more true for mixed execution
models We need a way to quantify ILP and TLP
contributions to bottom-line performance
Proposed model:– Able to break benefits in ILP/TLP
contributions
Performance Model
Sall = Sseq x Silp x Sovl 1. Compute overall speedup (Sall)
ICS 2009 10
Tseq/Tmt
Performance Model
Sall = Sseq x Silp x Sovl 1. Compute overall speedup (Sall)2. Compute sequential TLS speedup (Sseq)
ICS 2009 11
Tseq/T1p
Performance Model
Sall = Sseq x Silp x Sovl 1. Compute overall speedup (Sall)2. Compute sequential TLS speedup (Sseq)3. Compute speedup due to ILP (Silp)
ICS 2009 12
(T1+T2)/(T1’+T2’)
Performance Model
Sall = Sseq x Silp x Sovl 1. Compute overall speedup (Sall)2. Compute sequential TLS speedup (Sseq)3. Compute speedup due to ILP (Silp)4. Use everything to compute TLP (Sovl)
ICS 2009 13
Sall/(Sseq x Silp)
ICS 2009 14
Outline
Introduction Speculative Multithreading Models Performance Model Unified Scheme Experimental Setup and Results Conclusions
Unified Execution Model Can we improve TLS?
1. Some of the threads do not help2. Slack in usage of cores
Improve TLP:– Requires a better compiler
Improve ILP:– Combine TLS with another SM !– Most of the HW common
ICS 2009 15
ICS 2009 16
Combining TLS, HT and RA
Start with TLS Provide support to clone TLS threads and
convert them to HT Conversion to HT means:
– Put them in RA mode– Suppress squashes and do not cause additional
squashes– Discard them when they finish
No compiler slicing purely HW approach
Intricacies to be Handled HT may not prefetch effectively! Dealing with contention
– HT threads much faster saturate BW
Dealing with thread ordering– TLS imposes total thread order– HT killed squashes TLS threads
ICS 2009 17
Creating and Terminating HT Create a HT on a L2 miss we can VP
– Use mem. address based confidence estimator– VP only if confident
Create a HT if we have a free processor Only allow most speculative thread to clone
– Seamless integration of HT with TLS– BUT: if parent no longer the most spec. TLS
thread, the HT has to be killed Additionally kill HT when:
– Parent/HT thread finishes– HT causes exception
ICS 2009 18
ICS 2009 19
Outline
Introduction Speculative Multithreading Models Performance Model Unified Scheme Experimental Setup and Results Conclusions
ICS 2009 20
Experimental Setup
Simulator, Compiler and Benchmarks:– SESC (http://sesc.sourceforge.net/)– POSH (Liu et al. PPoPP ‘06)– Spec 2000 Int.
Architecture:– Four way CMP, 4-Issue cores– 16KB L1 Data (multi-versioned) and Instruction Caches– 1MB unified L2 Caches– Inst. window/ROB – 80/104 entries– 16KB Last Value Predictor
ICS 2009 21
Comparing TLS, RunAhead and Unified Scheme
ICS 2009 22
Comparing TLS, RunAhead and Unified Scheme
Almost additive benefits
ICS 2009 23
Comparing TLS, RunAhead and Unified Scheme
Almost additive benefits 10.2% over TLS, 18.3% over RA
Understanding the extra ILP Improvements of ILP come from:
– Mainly memory – Branch prediction (improvement
0.5%) Focus on memory:
– Miss rate on committed path– Clustering of misses (different cost)
ICS 2009 24
Normalized Shared Cache Misses
All schemes better than sequential Unified 41% better than sequential
ICS2009 25
Isolated vs. Clustered Misses
. Both TLS + RA Large window
machines Unified does even better
ICS 2009 26
ICS 2009 27
Outline
Introduction Multithreading Models Performance Model Unified Scheme Experimental Setup and Results Conclusions
Also on the paper …
Dealing with the load of the system Converting TLS threads to HT Multiple HT Effect of a better VP Detailed comparison of performance
model against existing models (Renau et. al ICS ’05)
ICS 2009 28
ICS 2009 29
Conclusions
CMPs are here to stay:– What about single threaded apps. and apps with
significant seq. sections? Different apps. require different SM
techniques– Even within apps. different phases
We propose the first mixed execution model– TLS is nicely complemented by HT and RA
Our unified scheme outperforms existing SM models– TLS by 10.2% avg. (up to 41.2%)– RA by 18.3 % avg. (up to 35.2%)
Combining Thread Level Speculation, Helper Threads,
and Runahead ExecutionPolychronis Xekalakis
Nikolas Ioannou and Marcelo Cintra
University of Edinburghhttp://www.homepages.inf.ed.ac.uk/mc/
Projects/VESPA
Recommended