Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf ·...

Preview:

Citation preview

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Accelerating Decoupled Look-aheadto Exploit Implicit Parallelism

Raj Parihar

Advisor: Prof. Michael C. Huang

March 22, 2013

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

MotivationDespite the proliferation of multi-core, multi-threaded systems

Single-thread performance is still an important design goal

Modern programs do not lack instruction level parallelism

Real challenge: exploit implicit parallelism without undue costs

One effective approach: Decoupled look-ahead

bzip2 crafty eon gap gcc gzip mcf pbmk twolf vortex vpr Gmean 1

10

100

IPC

128 512 2K 128 512 2K 107

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 2

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

MotivationDespite the proliferation of multi-core, multi-threaded systems

Single-thread performance is still an important design goal

Modern programs do not lack instruction level parallelism

Real challenge: exploit implicit parallelism without undue costs

One effective approach: Decoupled look-ahead

bzip2 crafty eon gap gcc gzip mcf pbmk twolf vortex vpr Gmean 1

10

100

IPC

128 512 2K 128 512 2K 107

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 3

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Motivation: Decoupled Look-ahead

Decoupled look-ahead architecture targets

Performance hurdles: branch mispredictions, cache misses, etc.Exploration of parallelization opportunities, dependence informationMicroarchitectural complexity, energy inefficiency through decoupling

The look-ahead thread can often become a new bottleneck

We explore techniques to accelerate the look-ahead thread

Speculative parallelization: aptly suited due to increasedparallelism in the look-ahead binaryWeak dependence: lack of correctness constraint allows weakinstruction removal w/o affecting the quality of look-ahead

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 4

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Motivation: Decoupled Look-ahead

Decoupled look-ahead architecture targets

Performance hurdles: branch mispredictions, cache misses, etc.Exploration of parallelization opportunities, dependence informationMicroarchitectural complexity, energy inefficiency through decoupling

The look-ahead thread can often become a new bottleneck

We explore techniques to accelerate the look-ahead thread

Speculative parallelization: aptly suited due to increasedparallelism in the look-ahead binaryWeak dependence: lack of correctness constraint allows weakinstruction removal w/o affecting the quality of look-ahead

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 4

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Motivation: Decoupled Look-ahead

Decoupled look-ahead architecture targets

Performance hurdles: branch mispredictions, cache misses, etc.Exploration of parallelization opportunities, dependence informationMicroarchitectural complexity, energy inefficiency through decoupling

The look-ahead thread can often become a new bottleneck

We explore techniques to accelerate the look-ahead thread

Speculative parallelization: aptly suited due to increasedparallelism in the look-ahead binaryWeak dependence: lack of correctness constraint allows weakinstruction removal w/o affecting the quality of look-ahead

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 4

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Outline

Motivation

Baseline decoupled look-ahead

Look-ahead thread acceleration

Speculative parallelization in look-ahead

Weak dependence removal in look-ahead

Experimental analysis

Summary

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 5

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Baseline Decoupled Look-ahead

Binary parser is used to generate skeleton from original programThe skeleton runs on a separate core and

Maintains its memory image in local L1, no writeback to shared L2Sends branch outcomes through FIFO queue; also helps prefetching

A. Garg and M. Huang, “A Performance-Correctness Explicitly Decoupled Architecture”, MICRO-08

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 6

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Practical Advantages of Decoupled Look-ahead

Look-ahead thread is a self-reliant agent,completely independent of main thread

No need for quick spawning and registercommunication supportLow management overhead on main threadEasier for run-time control to disable

Natural throttling mechanism to prevent

Run-away prefetching, cache pollution

Look-ahead thread size comparable to

aggregation of short helper threads

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 7

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Practical Advantages of Decoupled Look-ahead

Look-ahead thread is a self-reliant agent,completely independent of main thread

No need for quick spawning and registercommunication supportLow management overhead on main threadEasier for run-time control to disable

Natural throttling mechanism to prevent

Run-away prefetching, cache pollution

Look-ahead thread size comparable to

aggregation of short helper threads

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 7

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Practical Advantages of Decoupled Look-ahead

Look-ahead thread is a self-reliant agent,completely independent of main thread

No need for quick spawning and registercommunication supportLow management overhead on main threadEasier for run-time control to disable

Natural throttling mechanism to prevent

Run-away prefetching, cache pollution

Look-ahead thread size comparable to

aggregation of short helper threads

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 7

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Look-ahead: A New BottleneckComparing four systems to discover new bottlenecks

Single-thread, decoupled look-ahead, ideal, and look-ahead aloneApplication categories:

Bottleneck removed or speed of look-ahead is not an issueLook-ahead thread is the new bottleneck (right half)

apl msa wup mgr six swim fac gal gcc gap eon fm3d gzip crft vor apsi vpr bzp2 eqk amp luc art pbmkmcf twlf0

1

2

3

4

IPC

Single−thread Decoupled look−ahead Ideal(Cache,Br) Look−ahead only

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 8

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Look-ahead: A New BottleneckComparing four systems to discover new bottlenecks

Single-thread, decoupled look-ahead, ideal, and look-ahead aloneApplication categories:

Bottleneck removed or speed of look-ahead is not an issueLook-ahead thread is the new bottleneck (right half)

apl msa wup mgr six swim fac gal gcc gap eon fm3d gzip crft vor apsi vpr bzp2 eqk amp luc art pbmkmcf twlf0

1

2

3

4

IPC

Single−thread Decoupled look−ahead Ideal(Cache,Br) Look−ahead only

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 9

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Look-ahead: A New BottleneckComparing four systems to discover new bottlenecks

Single-thread, decoupled look-ahead, ideal, and look-ahead aloneApplication categories:

Bottleneck removed or speed of look-ahead is not an issueLook-ahead thread is the new bottleneck (right half)

apl msa wup mgr six swim fac gal gcc gap eon fm3d gzip crft vor apsi vpr bzp2 eqk amp luc art pbmkmcf twlf0

1

2

3

4

IPC

Single−thread Decoupled look−ahead Ideal(Cache,Br) Look−ahead only

81%53%

47%27%

11%

70%11%

16%35%

47%68%

154%189%

117%

262%262%

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 10

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Look-ahead: A New BottleneckComparing four systems to discover new bottlenecks

Single-thread, decoupled look-ahead, ideal, and look-ahead aloneApplication categories:

Bottleneck removed or speed of look-ahead is not an issueLook-ahead thread is the new bottleneck (right half)

apl msa wup mgr six swim fac gal gcc gap eon fm3d gzip crft vor apsi vpr bzp2 eqk amp luc art pbmkmcf twlf0

1

2

3

4

IPC

Single−thread Decoupled look−ahead Ideal(Cache,Br) Look−ahead only

81%53%

47%27%

11%

70%11%

16%35%

47%68%

154%189%

117%

262%262%

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 11

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Speculative parallelization in look-aheadWeak dependence removal in look-ahead

Unique Opportunities for Speculative Parallelization

Skeleton code offers more parallelism

Certain dependencies removed duringslicing for skeletonShort-distance dependence chainsbecome long-distance chains, suitablefor TLP exploitation

Look-ahead is inherently error-tolerant

Can ignore dependence violationsLittle to no support needed, unlike inconventional TLS

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 12

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Speculative parallelization in look-aheadWeak dependence removal in look-ahead

Unique Opportunities for Speculative Parallelization

Skeleton code offers more parallelism

Certain dependencies removed duringslicing for skeletonShort-distance dependence chainsbecome long-distance chains, suitablefor TLP exploitation

Look-ahead is inherently error-tolerant

Can ignore dependence violationsLittle to no support needed, unlike inconventional TLS

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 12

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Speculative parallelization in look-aheadWeak dependence removal in look-ahead

Software Support

Dependence analysis

Profile guided, coarse-grain at basicblock level

Spawn and Target points

Basic blocks with consistentdependence distance of more thanthreshold of DMIN

Spawned thread executes fromtarget point

Loop level parallelism is also

exploited

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 13

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Speculative parallelization in look-aheadWeak dependence removal in look-ahead

Software Support

Dependence analysis

Profile guided, coarse-grain at basicblock level

Spawn and Target points

Basic blocks with consistentdependence distance of more thanthreshold of DMIN

Spawned thread executes fromtarget point

Loop level parallelism is also

exploited

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 13

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Speculative parallelization in look-aheadWeak dependence removal in look-ahead

Software Support

Dependence analysis

Profile guided, coarse-grain at basicblock level

Spawn and Target points

Basic blocks with consistentdependence distance of more thanthreshold of DMIN

Spawned thread executes fromtarget point

Loop level parallelism is also

exploited

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 13

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Speculative parallelization in look-aheadWeak dependence removal in look-ahead

Parallelism Potential in Look-ahead Binary

Available parallelism for 2 core/contexts system; DMIN = 15BB

Skeleton exhibits significant more BB level parallelism (17%)Loop based FP applications exhibit more BB level parallelism

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 14

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Speculative parallelization in look-aheadWeak dependence removal in look-ahead

Parallelism Potential in Look-ahead Binary

Available parallelism for 2 core/contexts system; DMIN = 15BB

Skeleton exhibits significant more BB level parallelism (17%)

Loop based FP applications exhibit more BB level parallelism

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 14

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Speculative parallelization in look-aheadWeak dependence removal in look-ahead

Parallelism Potential in Look-ahead Binary

Available parallelism for 2 core/contexts system; DMIN = 15BB

Skeleton exhibits significant more BB level parallelism (17%)Loop based FP applications exhibit more BB level parallelism

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 14

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Speculative parallelization in look-aheadWeak dependence removal in look-ahead

Hardware and Runtime Support

Thread spawning and merging are verysimilar to regular thread spawning except

Spawned thread shares the same registerand memory stateSpawning thread terminates at the target PC

Value communication

Register-based naturally through sharedregisters in SMTMemory-based communication can besupported at different levelsPartial versioning in cache at line level

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 15

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Speculative parallelization in look-aheadWeak dependence removal in look-ahead

Hardware and Runtime Support

Thread spawning and merging are verysimilar to regular thread spawning except

Spawned thread shares the same registerand memory stateSpawning thread terminates at the target PC

Value communication

Register-based naturally through sharedregisters in SMTMemory-based communication can besupported at different levelsPartial versioning in cache at line level

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 15

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Speculative parallelization in look-aheadWeak dependence removal in look-ahead

Speedup of Speculative Parallelization

14 applications in which the look-ahead thread is bottleneck

Speedup of look-ahead systems over single-thread

Decoupled look-ahead over single-thread baseline: 1.53xSpeculative parallel look-ahead over single-thread: 1.73x

Speculative look-ahead over decoupled look-ahead: 1.13x

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 16

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Speculative parallelization in look-aheadWeak dependence removal in look-ahead

Speedup of Speculative Parallelization

14 applications in which the look-ahead thread is bottleneckSpeedup of look-ahead systems over single-thread

Decoupled look-ahead over single-thread baseline: 1.53x

Speculative parallel look-ahead over single-thread: 1.73x

Speculative look-ahead over decoupled look-ahead: 1.13x

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 16

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Speculative parallelization in look-aheadWeak dependence removal in look-ahead

Speedup of Speculative Parallelization

14 applications in which the look-ahead thread is bottleneckSpeedup of look-ahead systems over single-thread

Decoupled look-ahead over single-thread baseline: 1.53xSpeculative parallel look-ahead over single-thread: 1.73x

Speculative look-ahead over decoupled look-ahead: 1.13x

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 16

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Speculative parallelization in look-aheadWeak dependence removal in look-ahead

Speedup of Speculative Parallelization

14 applications in which the look-ahead thread is bottleneckSpeedup of look-ahead systems over single-thread

Decoupled look-ahead over single-thread baseline: 1.53xSpeculative parallel look-ahead over single-thread: 1.73x

Speculative look-ahead over decoupled look-ahead: 1.13x

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 16

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Speculative parallelization in look-aheadWeak dependence removal in look-ahead

Speculative Look-ahead vs Conventional TLS

Skeleton provides more opportunities for parallelization

Speculative look-ahead over decoupled LA baseline: 1.13xSpeculative main thread over single thread baseline: 1.07x

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 17

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Speculative parallelization in look-aheadWeak dependence removal in look-ahead

Speculative Look-ahead vs Conventional TLS

Skeleton provides more opportunities for parallelization

Speculative look-ahead over decoupled LA baseline: 1.13x

Speculative main thread over single thread baseline: 1.07x

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 17

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Speculative parallelization in look-aheadWeak dependence removal in look-ahead

Speculative Look-ahead vs Conventional TLS

Skeleton provides more opportunities for parallelization

Speculative look-ahead over decoupled LA baseline: 1.13xSpeculative main thread over single thread baseline: 1.07x

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 17

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Speculative parallelization in look-aheadWeak dependence removal in look-ahead

Motivation for Exploiting Weak Dependences

Not all instructions are equally important and critical

Example of weak instructions:

Inconsequential adjustmentsLoad and store instructions thatare (mostly) silentDynamic NOP instructions

Plenty of weak instructions are

present in programs

Challenges involved:

Context-based, hard to identify and combine – much like Jenga

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 18

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Speculative parallelization in look-aheadWeak dependence removal in look-ahead

Motivation for Exploiting Weak Dependences

Not all instructions are equally important and critical

Example of weak instructions:

Inconsequential adjustmentsLoad and store instructions thatare (mostly) silentDynamic NOP instructions

Plenty of weak instructions are

present in programs

Challenges involved:

Context-based, hard to identify and combine – much like Jenga

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 18

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Speculative parallelization in look-aheadWeak dependence removal in look-ahead

Comparison of Weak and Strong Instructions

Static attributes of weak and strong insts are remarkably same

Static attributes: opcode, number of inputsThe correlation coefficient of the two distributions is 0.96

Weakness has very poor correlation with static attributes

Hard to identify the weak insts through static heuristics

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 19

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Speculative parallelization in look-aheadWeak dependence removal in look-ahead

Comparison of Weak and Strong Instructions

Static attributes of weak and strong insts are remarkably same

Static attributes: opcode, number of inputsThe correlation coefficient of the two distributions is 0.96

Weakness has very poor correlation with static attributes

Hard to identify the weak insts through static heuristics

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 19

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Speculative parallelization in look-aheadWeak dependence removal in look-ahead

Genetic Algorithm based Framework

Genetic algorithm based framework to identify and eliminateweak instructions from the look-ahead skeleton

Genetic evolution: procreation and natural selectionChromosomes creation and hybridizationBaseline look-ahead skeleton construction

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 20

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Speculative parallelization in look-aheadWeak dependence removal in look-ahead

Genetic Algorithm based Framework

Genetic algorithm based framework to identify and eliminateweak instructions from the look-ahead skeleton

Genetic evolution: procreation and natural selection

Chromosomes creation and hybridizationBaseline look-ahead skeleton construction

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 20

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Speculative parallelization in look-aheadWeak dependence removal in look-ahead

Genetic Algorithm based Framework

Genetic algorithm based framework to identify and eliminateweak instructions from the look-ahead skeleton

Genetic evolution: procreation and natural selectionChromosomes creation and hybridizationBaseline look-ahead skeleton construction

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 20

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Speculative parallelization in look-aheadWeak dependence removal in look-ahead

Heuristic Based Solutions

Heuristic based solutions are helpful to jump start the evolution

Superposition based chromosomesOrthogonal subroutine based chromosomes

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 21

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Speculative parallelization in look-aheadWeak dependence removal in look-ahead

Progress of Genetic Evolution Process

Per generation progress compared to the final best solution

After 2 generations, more than half of the benefits are achievedAfter 5 generations, at least 90% of benefits are achieved

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 22

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Experimental Setup

Program/binary analysis tool: based

on ALTO

Simulator: based on heavilymodified SimpleScalar

SMT, look-ahead and speculativeparallelization supportTrue execution-driven simulation(faithfully value modeling)

Genetic algorithm framework

Modeled as offline and onlineextension to the simulator

Microarchitectural config:

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 23

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Speedup of Self-tuned Look-ahead

Applications in which the look-ahead thread is a bottleneck

Self-tuned, genetic algorithm based decoupled look-ahead

Speedup over baseline decoupled look-ahead: 1.16xSpeedup over single-thread baseline: 1.78x

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 24

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Speedup of Self-tuned Look-ahead

Applications in which the look-ahead thread is a bottleneckSelf-tuned, genetic algorithm based decoupled look-ahead

Speedup over baseline decoupled look-ahead: 1.16x

Speedup over single-thread baseline: 1.78x

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 24

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Speedup of Self-tuned Look-ahead

Applications in which the look-ahead thread is a bottleneckSelf-tuned, genetic algorithm based decoupled look-ahead

Speedup over baseline decoupled look-ahead: 1.16xSpeedup over single-thread baseline: 1.78x

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 24

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Comparison with Speculative Parallel Look-ahead

Self-tuned skeleton is used in the speculative parallel look-ahead

In some cases, self-tuned and speculative parallel look-ahead

techniques are synergistic (ammp, art)

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 25

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Ongoing and Future Explorations

Load balancing through skipping non-critical branches

Weak instruction classification and identification

Single-core version of decoupled look-ahead

Static heuristics to construct adaptive skeletons

Role of look-ahead to promote parallelization and accelerate the

execution of interpreted programs

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 26

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Summary

Decoupled look-ahead can uncover significant implicit parallelism

However, look-ahead thread often becomes a new bottleneck

Fortunately, look-ahead lends itself to various optimizations:

Speculative parallelization is more beneficial in look-ahead threadWeak instructions can be removed w/o affecting look-ahead quality

Intelligent look-ahead technique is a promising solution in the

era of flat frequency and modest microarchitecture scaling

Idle cores in multicore environment will further strengthen the

case of decoupled look-ahead adoption in mainstream system

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 27

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Backup Slides

Accelerating Decoupled Look-aheadto Exploit Implicit Parallelism

Raj Parihar

Advisor: Prof. Michael C. Huang

March 22, 2013

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 28

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Microthreads vs Decoupled Look-ahead

Lightweight Microthreads: Decoupled Look-ahead:

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 29

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Look-ahead Skeleton Construction

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 30

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Performance Benefits of Decoupled Look-ahead

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 31

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

IPC of Speculative Parallelization

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 32

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Speculative Parallelization: Cortex-A9 vs POWER5

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 33

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Flexibility in Look-ahead Hardware Design

Comparison of regular (partial versioning) cache support with twoother alternatives

No cache versioning supportDependence violation detection and squash

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 34

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Partial Recoveries and Spawns

Partial recoveries:

Breakdown of all the spawns:

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 35

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Genetic Algorithm Evolution

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 36

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Multi-instruction Gene Examples

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 37

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Optimizations to Implementation

Fitness test optimizations

Sampling based fitnessMulti-instruction genesEarly termination of tests

GA framework optimizations

Hybridization of solutionsAdaptive mutation rateUnique chromosomesFusion crossover operatorElitism policy

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 38

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Superposition based Chromosomes

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 39

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Recovery based Early Termination of Fitness Test

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 40

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Weak Dependence: IPC and Instructions Removed

IPC comparison:

Instructions removed from baseline skeleton:

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 41

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Sampling based Fitness Test

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 42

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

L2 Cache Sensitivity Study

Speedup for various L2 caches is quite stable

1.161x (1 MB), 1.154x (2 MB), and 1.152x (4 MB) L2 caches

Avg. speedups, shown in the figure, are relative to

single-threaded execution with a 1 MB L2 cache

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 43

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Single-Gene vs Heuristic based Chromosomes

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 44

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Genetic Algorithm based Heuristics

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 45

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Other Details

Energy reduction: 11% over baseline decoupled look-ahead

Reduced cache accesses, less stalling of main thread

On an average, 10% of the dynamic instructions are removed

from the baseline skeletonOffline profiling and control software overhead

Offline profiling time: 2 to 20 seconds on the target machineOnline control software: 17 million instructions for whole evolution

Average extra recoveries: 3-4 per 100,000 instructions

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 46

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Load Balancing in Look-ahead

Non-critical branches can be transformed to accelerate thelook-ahead thread and achieve better load balancing

Non-critical branches are skipped in the look-ahead threadMain thread executes such branches by itself w/o any helpWe call these branches Do-It-Yourself or DIY branches

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 47

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Preliminary Speedup of Load Balancing

Load balancing through skipping of the non-critical branches

Max. speedup over decoupled look-ahead: 1.76x (art)Avg. speedup over decoupled look-ahead: 1.12x (Gmean)

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 48

Recommended