Thread-Level Speculation: Towards Ubiquitous Parallelism Greg Steffan School of Computer Science

1Thread-Level Speculation SteffanCarnegie Mellon

Thread-Level Speculation: Thread-Level Speculation:

Towards Ubiquitous ParallelismTowards Ubiquitous Parallelism

Greg SteffanGreg Steffan

School of Computer ScienceSchool of Computer Science

Carnegie Mellon UniversityCarnegie Mellon University


Moore’s Law: the Moore’s Law: the Original VersionOriginal Version

Log

trans

istor

s on

a ch

ip

Time

exponentially increasing resources


Moore’s Law: the Popular InterpretationMoore’s Law: the Popular Interpretation

Log

perfo

rman

ce

Time

increase resources increase performance?


A Superposition of InnovationsA Superposition of Innovations

Datapath Size(8b, 16b, 32b, 64b)Lo

g of

Per

form

ance

Time

ILP is running out of steam

Instruction-LevelParallelism (ILP)


Why ILP is Running Out of SteamWhy ILP is Running Out of Steam

Cross-chip wire latency (in cycles):Cross-chip wire latency (in cycles):

Development cost:Development cost:

Power density:Power density:

Probability of a defect:Probability of a defect:

these problems must be addressed


How Do We Sustain the Performance Curve?How Do We Sustain the Performance Curve?

Datapath Size(8b, 16b, 32b, 64b)Lo

g of

Per

form

ance

Time

what is the next big win for micro-architecture?

Instruction-LevelParallelism (ILP)

?we are here

now


A New Path: Thread-Level ParallelismA New Path: Thread-Level Parallelism

Tolerate cross-chip wire latency:Tolerate cross-chip wire latency:– localized wireslocalized wires

Lower development cost:Lower development cost:– stamp out processor coresstamp out processor cores

Lower power:Lower power:– turn off idle processorsturn off idle processors

Tolerate defects:Tolerate defects:– disable any faulty processordisable any faulty processor

many advantages

C

C

P

C

P

Chip Multiprocessor (CMP)

Processors

Caches


Multithreading in Every Scale of MachineMultithreading in Every Scale of Machine

Supercomputers

Threads

DesktopsChip Multiprocessor (CMP)

Cache

Proc Proc

(IBM Power4, SUN MAJC, Sibyte SB-1250)

multithreading on a chip!

Simultaneous-Multithreading(ALPHA 21464,

Intel Xeon)

Cache

Proc


Improving Performance with a Chip MultiprocessorImproving Performance with a Chip Multiprocessor

C

C

P

C

P

C

P

C

P

C

C

P

Multiprogramming Workload:

ExecutionTime

improves throughput

Processor

Caches

Applications


Improving Performance with a Chip MultiprocessorImproving Performance with a Chip Multiprocessor

C

C

P

C

P

C

P

C

P

C

C

P

Single Application:

need parallel threads to reduce execution time

C

C

P

C

P

C

P

C

P

Exec.Time


How Do We Parallelize Everything?How Do We Parallelize Everything?

1) Programmers write parallel code from now on1) Programmers write parallel code from now on– time-consuming and frustratingtime-consuming and frustrating

– very hard to get rightvery hard to get right

– not a broad solutionnot a broad solution

2) System parallelizes automatically2) System parallelizes automatically– no burden on the programmerno burden on the programmer

– parallelize any applicationparallelize any application

automatic parallelization is preferred


Current Technique: Prove IndependenceCurrent Technique: Prove Independence

IndependentIndependent

DependentDependent

for (i = 0;i < N;i++) A[i] = 0;

for (i = 1;i < N;i++) A[i] = A[i-1];

A[0]0A[1]0

A[2]0

A[1]A[0]A[2]A[1]

A[3]A[2]

need to fully understand data access pattern


Ubiquitous Parallelization: How Close Are We?Ubiquitous Parallelization: How Close Are We?

Compiler can parallelize portions of numeric programsCompiler can parallelize portions of numeric programs– scientific, floating-point, array-based codesscientific, floating-point, array-based codes

– usually written in fortranusually written in fortran

What about everything else?What about everything else?– general-purpose, integer codesgeneral-purpose, integer codes

– written in C, C++, Java, etc.written in C, C++, Java, etc.

– little (if any) success so farlittle (if any) success so far

parallelize by proving independence

proving independence is infeasible


The Main Culprit: IndirectionThe Main Culprit: Indirection

for (i = 0;i < N;i++) A[i] = A[B[i]];

while (...){... = *q;*p = ...;

}

need to know the values of B[]

need to know the targets of p and q

PointersPointers

Indirect array referencesIndirect array references A[0]A[B[0]]A[1]A[B[1]]

A[2]A[B[2]]

?

?

… *q*p …

… *q*p …

?


SummarySummary

We need the next big performance winWe need the next big performance win– instruction-level parallelism will run out of gasinstruction-level parallelism will run out of gas

Multithreading will soon be everywhereMultithreading will soon be everywhere– we need automatically-parallelized programswe need automatically-parallelized programs

The scope of current techniques is extremely limitedThe scope of current techniques is extremely limited– proving independence is infeasibleproving independence is infeasible

A solution: Thread-Level Speculation (TLS)


Thread-Level Speculation: the Basic IdeaThread-Level Speculation: the Basic Idea

exploit available thread-level parallelism

Exec.Time TLS

…*q*p…

Recover

…*q

violation


OutlineOutline

The Software/Hardware Sweet SpotThe Software/Hardware Sweet Spot

• Compiler SupportCompiler Support

• Industry-Friendly HardwareIndustry-Friendly Hardware

• Improving Value CommunicationImproving Value Communication

• ConclusionsConclusions


Support for TLS: What Do We Need?Support for TLS: What Do We Need?

Break programs into speculative threadsBreak programs into speculative threads– to maximize thread-level parallelismto maximize thread-level parallelism

Track data dependencesTrack data dependences– to determine whether speculation was safeto determine whether speculation was safe

Recover from failed speculationRecover from failed speculation– to ensure correct executionto ensure correct execution

three key elements of every TLS system


Compiler Researche

rsdo it

in Software


LRPD Test (Illinois at UC)LRPD Test (Illinois at UC)

++ implemented entirely in software implemented entirely in software

–– applies only to array-based codeapplies only to array-based code

–– no partial parallelismno partial parallelism

softwaredependencetracking

was parallelexecution safe?

Exec.Time


Architects do it

in Hardware


Multiscalar (Wisconsin)Multiscalar (Wisconsin)

• compiler breaks program into threadscompiler breaks program into threads

• Address Resolution BufferAddress Resolution Buffer (ARB) (ARB)

+ + –– highly specialized for speculation highly specialized for speculation

ARBP

PP P

P

P

P P


Our Approach: Find the Sweet SpotOur Approach: Find the Sweet Spot

Compiler:Compiler:++ global view of control flow global view of control flow

–– hard/impossible to understand data dependenceshard/impossible to understand data dependences

Hardware:Hardware:–– operates on a small window of instructions operates on a small window of instructions

++ observes dynamic memory accesses observes dynamic memory accesses

leverage their respective strengths


The Sweet SpotThe Sweet Spot

• Compiler: Compiler: – break programs into speculative threadsbreak programs into speculative threads

• why: compiler has a global view of control flowwhy: compiler has a global view of control flow

• Hardware:Hardware:– track data dependencestrack data dependences

• why: software comparison of all addresses infeasiblewhy: software comparison of all addresses infeasible

– recover from failed speculationrecover from failed speculation• why: software buffering of all writes infeasiblewhy: software buffering of all writes infeasible

important: minimize additional hardware


OutlineOutline


Compiler SupportCompiler Support

• Industry-Friendly HardwareIndustry-Friendly Hardware




MIPSExecutable

Compiler Support for TLSCompiler Support for TLS

RegionSelection

Transformation and

Optimization

SequentialSourceCode

insertsTLS instructions

profileinformation which loops?


Simple Performance ModelSimple Performance Model

P P P P

DependenceTracking

• 4 processors• Each processor issues one instruction per cycle • No communication latency between processors

shows potential performance benefit


Potential ImprovementPotential Improvement

significant impact on execution time


OutlineOutline



Industry-Friendly HardwareIndustry-Friendly Hardware




GoalsGoals

1) Handle arbitrary memory accesses1) Handle arbitrary memory accesses– i.e. not just array referencesi.e. not just array references

2) Preserve single-thread performance2) Preserve single-thread performance– keep hardware support minimal and simplekeep hardware support minimal and simple

3) Apply to any scale of multithreaded architecture3) Apply to any scale of multithreaded architecture– within a chip and beyondwithin a chip and beyond

effective, simple, scalable


RequirementsRequirements

1) Recover from failed speculation1) Recover from failed speculation• buffer speculative writes from memory buffer speculative writes from memory

2) Track data dependences 2) Track data dependences • detect data dependence violationsdetect data dependence violations

each has several implementation options


Recover From Failed Speculation: Option 1Recover From Failed Speculation: Option 1

Augment the store buffer:Augment the store buffer:+ + common device in superscalar processorscommon device in superscalar processors

• facilitates non-blocking storesfacilitates non-blocking stores

–– too smalltoo small

Procstore buffer


Add a new dedicated bufferAdd a new dedicated buffer+ + can design an efficient speculation mechanismcan design an efficient speculation mechanism

–– want to avoid large speculation-specific structureswant to avoid large speculation-specific structures

Proc



Augment the cacheAugment the cache+ + very common structurevery common structure

+ + relatively largerelatively large

Cache

Proc

just maintain single-thread performance



Tracking Data Dependences: Option 1Tracking Data Dependences: Option 1

Add a dedicated “3Add a dedicated “3rdrd-party” entity-party” entity–– want to avoid large speculation-specific structureswant to avoid large speculation-specific structures

–– does not scaledoes not scale

C

P

C

P

DependenceTracker

Load XStore X

violationdetected



Detection at the producerDetection at the producer• producer informed of all addresses consumedproducer informed of all addresses consumed

–– awkward: producer must notify consumer of any violationawkward: producer must notify consumer of any violation

C

P

C

P

Load X Store X

load address

violationdetected

Producer Consumer



Detection at the consumer Detection at the consumer • consumers informed of all addresses producedconsumers informed of all addresses produced

C

P

C

P

Load X Store X

store address violation

detected

similar to invalidation-based cache coherence!

Producer Consumer


Augmenting the CacheAugmenting the Cache

CacheTagState Data

-- --- -

-- --- -

P



CacheState Data

- -- -

- -

Tag--

--- -

SL--

--

SM--

--

SpeculativelyModified

SpeculativelyLoaded

modest amount of extra space

P


valid


CacheState Datavalid #valid #

valid #

TagXV

YZ #

SL00

00

SM11

01

P

when speculation fails…


invalid


CacheState Datainvalid -invalid -

valid #

Tag--

Y- -

SL0

0

00

SM00

00

P

…can quickly discard speculative state


Extending Cache CoherenceExtending Cache Coherence

C

P

C

P

Load X Store X

invalidate X; from 4 violation

detected (4<5)

4 5

X is speculativelyloaded

straightforward extension of cache coherence


Detailed Performance ModelDetailed Performance Model

Underlying architectureUnderlying architecture– single-chip multiprocessorsingle-chip multiprocessor

– implements speculative coherenceimplements speculative coherence

SimulatorSimulator– superscalar, a modernized superscalar, a modernized MIPS R10KMIPS R10K– models all bandwidth and contentionmodels all bandwidth and contention

detailed simulation!

C

C

P

C

P

Crossbar


Will it Work at All of These Scales?Will it Work at All of These Scales?

Supercomputers

Threads

Desktops

yes: coherence scales up and down

Chip Multiprocessor (CMP)

Cache

Proc Proc

Simultaneous-Multithreading

Cache

Proc


Performance on Multi-Chip SystemsPerformance on Multi-Chip Systems

our scheme is scalable


Performance on General-Purpose ApplicationsPerformance on General-Purpose Applications

significant performance improvements


OutlineOutline



Industry-Friendly HardwareIndustry-Friendly Hardware

Improving Value CommunicationImproving Value Communication



SpeculateSpeculate

good when p != q

Store *p

Load *q

Memory


Synchronize (and forward)Synchronize (and forward)

good when p == q

Store *p

Load *q

Memory

SignalWait(stall)

Store *pLoad *q

Memory

(Speculate)


Reduce the Critical Forwarding PathReduce the Critical Forwarding Path

Wait

Load X

Store X

Signal

Overview Big Critical Path Small Critical Path

decreases execution time

criticalpath

stall execution time

execution time


PredictPredict

good when p == q and *q is predictable

Store *p

Load *q

Memory

ValuePredictor

Store *p

Load *q

Memory

SignalWait(stall)

(Synchronize)

Store *pLoad *q

Memory

(Speculate)


Improving on Compile-Time DecisionsImproving on Compile-Time Decisions

Predict

Speculate

Synchronize

Compiler

Speculate

Synchronize

Hardware

reduce criticalforwarding path

reduce criticalforwarding path

improve the efficiency of value communication


TechniquesTechniques

Prediction Prediction – memory value predictionmemory value prediction

– forwarded value predictionforwarded value prediction

– silent storessilent stores

SynchronizationSynchronization– dynamic synchronizationdynamic synchronization

– compiler scheduling to reduce the critical pathcompiler scheduling to reduce the critical path

– hardware prioritization to reduce the critical path hardware prioritization to reduce the critical path $$$$$$

inexpensive, except for hardware prioritization


Execution Time BreakdownExecution Time Breakdown


Performance on 4 ProcessorsPerformance on 4 Processors

S=Sequential, B=Baseline

lots of failed speculation and synchronization


Performance on 4 ProcessorsPerformance on 4 Processors

S=Sequential, B=Baseline, O=Optimizations

significant improvement


ConclusionsConclusions

• TLS may be the next big winTLS may be the next big win

• Industry-friendly hardware is possibleIndustry-friendly hardware is possible

• Efficient value communication is keyEfficient value communication is key

Ongoing/future work:Ongoing/future work:– compiler: improving region selection and coveragecompiler: improving region selection and coverage

– hardware: improve cache localityhardware: improve cache locality

Documents

Thread-Level Speculation: Towards Ubiquitous Parallelism Greg Steffan School of Computer Science