30
Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm Nikolas Ioannou , Jeremy Singer, Salman Khan, Polychronis Xekalakis, Paraskevas Yiapanis, Adam Pocock, Gavin Brown, Mikel Lujan, Ian Watson, and Marcelo Cintra University of Edinburgh http:// homepages.inf.ed.ac.uk/ mc/Projects/VESPA University of Manchester http://apt.cs.man.ac.uk/ projects/iTLS

Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm

  • Upload
    keagan

  • View
    22

  • Download
    4

Embed Size (px)

DESCRIPTION

Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm. University of Edinburgh http://homepages.inf.ed.ac.uk/mc/Projects/VESPA. University of Manchester http://apt.cs.man.ac.uk/projects/iTLS. - PowerPoint PPT Presentation

Citation preview

Page 1: Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm

Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm

Nikolas Ioannou, Jeremy Singer, Salman Khan, Polychronis Xekalakis, Paraskevas Yiapanis, Adam Pocock, Gavin Brown, Mikel Lujan, Ian

Watson, and Marcelo CintraUniversity of Edinburgh

http://homepages.inf.ed.ac.uk/mc/

Projects/VESPA

University of Manchesterhttp://apt.cs.man.ac.uk/

projects/iTLS

Page 2: Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm

Intl. Symp. on Workload Characterization - December 2010 2

Introduction

Thermal/power constraints, complexity and time-to-market reasons lead to CMPs

Many simple cores = high TLP but low ILP– Ok for throughput computing, server

workloads, and embarrassingly parallel applications

Problem:– No benefits for sequential applications– Parallel applications with large sequential

parts are still limited by Amdahl => Thread Level Speculation (TLS)

Page 3: Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm

Intl. Symp. on Workload Characterization - December 2010 3

Modivation

Shortcoming of prior work in assessing TLS performance potential

– Evaluations often tied to particular TLS architectural configuration

– Proposals of new extensions naturally focused on particular extensions not investigating interplay with other features

– Workload choice often limited to one particular domain or programming style

Page 4: Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm

Intl. Symp. on Workload Characterization - December 2010 4

Contributions

In-depth implementation-independent study of TLS performance potential

Evaluate TLS architectural features

Evaluate workloads from a variety of domains

Investigate load imbalance and coverage within the context of TLS

Page 5: Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm

Intl. Symp. on Workload Characterization - December 2010 5

Outline

Introduction Background Methodology Results Conclusions

Page 6: Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm

Intl. Symp. on Workload Characterization - December 2010 6

Thread Level Speculation

Compiler deals with:– Task selection– Code generation

HW deals with:– Different context– Spawn threads– Detecting violations– Replaying – Arbitrate commit

Thread 1

Thread 2

Speculative

Tim

e

Page 7: Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm

Intl. Symp. on Workload Characterization - December 2010 7

Architectural Extensions

Multiversioned caches

Support for out-of-order spawning

Dynamic dependence synchronization

Intermediate checkpointing

Data value prediction

Page 8: Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm

Intl. Symp. on Workload Characterization - December 2010 8

Outline

Introduction Background Methodology Results Conclusions

Page 9: Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm

Intl. Symp. on Workload Characterization - December 2010 9

Methodology

Benchmarks– Imperative:

SPEC CPU 2006 Mediabench II

Instrumentation– GCC4 pass

Annotate loop iterations and method bodies

Mark induction, reduction variables and use of return values

Operate after the intermediate optimizations

– Object oriented: SPEC JVM 98 DaCapo

– Jikes RVM modification

Page 10: Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm

Intl. Symp. on Workload Characterization - December 2010 10

Methodology

Trace Generation– Simics, full-system functional simulator– Non-intrusive trace of memory accesses

Trace-Driven Simulation– In-house Simulator-tool

Extracts threads out of loop iterations and/or method call cont.

Simulates: multi-versioned caches, OoO spawning, dynamic dependence synch, and value prediction

Page 11: Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm

Intl. Symp. on Workload Characterization - December 2010 11

Methodology

Task Selection– In-order loop-level speculation

Innermost loops

Best loops out of three dynamic depth levels

– In-order method and Out-of-Order speculation Dynamic thread spawning policy favoring safer

threads

Maximum thread size heuristic

– All loops and/or methods are candidates

Page 12: Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm

Intl. Symp. on Workload Characterization - December 2010 12

Outline

Introduction Background Methodology Results Conclusions

Page 13: Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm

Intl. Symp. on Workload Characterization - December 2010 13

Loop-level speculation - Innermost

Iter. 1

Iter. 2

Speculative

Iter. n

for(i=0;i<m;i++){ outer_loop_body1 for(j=0;j<l;j++) { inner_loop_body1 for(k=0;k<n;k++) { spawn_thread(); innermost_loop_body } inner_loop_body2 } outer_loop_body1}

Page 14: Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm

Intl. Symp. on Workload Characterization - December 2010 14

Loop-level speculation - Innermost

Page 15: Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm

Intl. Symp. on Workload Characterization - December 2010 15

Iter. 1

Iter. 2

Speculative

Iter. BD

for(i=0;i<m;i++){ outer_loop_body1 for(j=0;j<l;j++) { spawn_thread(); inner_loop_body1 for(k=0;k<n;k++) { innermost_loop_body } inner_loop_body2 } outer_loop_body1}

Loop-level speculation – Best loop depth

Page 16: Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm

Intl. Symp. on Workload Characterization - December 2010 16

Loop-level speculation – Best loop depth

Page 17: Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm

17

Method-level speculation - In-Order

methodmethodCont.

Speculative

pid = spawn_thread();If(pid !=0) method(); method _Cont.

Page 18: Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm

Intl. Symp. on Workload Characterization - December 2010 18

Method-level speculation - In-Order

Page 19: Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm

19

Method-level speculation - OoO

method1

method2Cont.

Speculativepid = spawn_thread();If(pid !=0) method1();

method1 _Cont.method1(){ method1_body1 pid = spawn_thread(); If(pid !=0) method1(); method2_cont}

method1Cont.

Tim

e

Page 20: Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm

Intl. Symp. on Workload Characterization - December 2010 20

Method-level speculation - OoO

Page 21: Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm

Intl. Symp. on Workload Characterization - December 2010 21

Mixed speculation - In-Order

Page 22: Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm

Intl. Symp. on Workload Characterization - December 2010 22

Mixed speculation - OoO

Page 23: Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm

Intl. Symp. on Workload Characterization - December 2010 23

Load Imbalance and Coverage

gcc

IOlo

op

gcc

IOm

etho

d

gcc

mix

ed

lbm

loop

lbm

mix

ed

libq

IOlo

op

libq

IOm

etho

d

libq

IOm

ixed

mcf

OoO

loop

mcf

mix

ed

sphi

nx3

IOlo

op

sphi

nx3

met

hod

sphi

nx3

OoOm

ixed

cjpe

g lo

op

cjpe

g OoO

met

hod

jpg2

Kd OoO

loop

jpg2

Kd OoO

met

hod

mpe

g4d

OoOlo

op

mpe

g4d

OoOm

etho

d

com

pres

s OoO

loop

com

pres

s m

ixed

pmd

loop

pmd

OoOm

etho

d

pmd

OoOm

ixed

0

0.2

0.4

0.6

0.8

1

0%

20%

40%

60%

80%

100%Load Imbalance

Norm

ali

zed

ove

r A

md

ah

l's

Law

S

peed

up

Perc

en

tag

e o

f P

rog

ram

Exe-

cu

tion

Page 24: Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm

Intl. Symp. on Workload Characterization - December 2010 24

Results – Multi-versioning to the rescue?

Page 25: Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm

Intl. Symp. on Workload Characterization - December 2010 25

Outline

Introduction Background Methodology Results Conclusions

Page 26: Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm

Intl. Symp. on Workload Characterization - December 2010 26

Conclusions

Load imbalance and limited coverage important factors in realizing TLS performance

Support for OoO spawning not providing significant benefits for the task policy employed

Multi-versioned caches unlock performance in some cases but not panacea

Task selection critical

Page 27: Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm

Intl. Symp. on Workload Characterization - December 2010 27

Also in the paper

In-depth analysis of high coverage loops for selected benchmarks

Comparison of TLS loop-level speculation with a state-of-the-art auto-parallelizing compiler

OoO Loop-level speculation

Outline most of the proposed architectural and compiler extensions for TLS systems

Page 28: Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm

Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm

Nikolas Ioannou, Jeremy Singer, Salman Khan, Polychronis Xekalakis, Paraskevas Yiapanis, Adam Pocock, Gavin Brown, Mikel Lujan, Ian

Watson, and Marcelo CintraUniversity of Edinburgh

http://homepages.inf.ed.ac.uk/mc/

Projects/VESPA

University of Manchesterhttp://intranet.cs.man.ac.uk/

apt/projects/iTLS

Page 29: Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm

Intl. Symp. on Workload Characterization - December 2010 29

Backup slides – Auto parallelizing compiler comparison

Page 30: Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm

Intl. Symp. on Workload Characterization - December 2010 30

Backup slides – OoO loop