Parallel Execution Models for Future Multicore Architectures Guri Sohi University of Wisconsin

Parallel Execution Models for Future Multicore Architectures

Guri SohiUniversity of Wisconsin

2

Outline

• Retrospective• The road ahead• Review existing parallel execution models• New parallel execution models and

opportunities– Program demultiplexing– Instrumented redundant multithreading

3

The Road Behind

• Hardware has continued to get fast– Mostly transparent to software

• Added software functionality directly impacts performance– Consequence of uni-processor execution– Limits additional software functionality

4

The Secret of Hardware Success

• Transparency to higher-level software• Very low level parallel execution • Appearance of sequential execution

– Software written with a sequential assumption• Easier to express• Easier to get right

5

The Road Ahead: Part 1

• Multicore architectures– Likely low core complexity to conserve power

• Limited exploitation of low-level parallelism

– Will need to achieve concurrency

• Increasing hardware unreliability– Will likely need help from software to enhance

system reliability

• Continuing software unreliability– Will likely result in additional (overhead)

functionality in software

6

The Road Ahead: Part 2

• Lots of multimedia applications– Possibly amenable to traditional forms of

concurrency

• Heavier use of modularity, encapsulation, information hiding, etc.– Amenable to traditional parallelization?– Benefit from or match different parallel

execution models?

• Heavier use of dynamic actions/decisions

7

The Big Challenges

• Execution models to achieve execution concurrency on multicore architectures– Concurrent processing of core work

• Building reliable software/hardware systems from unreliable software and hardware components– Redundancy: additional overhead work – Redundancy as opportunity for concurrent

processing

8

Stepping Back

Given an ordered sequence of tasks

• Process them in the given order: Sequential

• Try to come up with unordered sequences that accomplish the same: Traditional Parallel

• Process in arbitrary order; give appearance of processing in given order: Proposed– Separate processing from giving appearance

9

Traditional Parallel

• Hardware people build a multiprocessor• Throw it to software people to use• Come up with correct unordered

sequences– This is very hard

• Use synchronization to ease reasoning– i.e., create order; restrict unorder

• Very difficult to parallelize transparently in the general case

10

Rethinking Traditional Parallelization

• Typically use speculation to alleviate ordering constraints– Speculative multithreading– Transactions

11

Speculative Multithreading

• Speculatively parallelize an application– Speculatively create unordered sequences from

ordered one – Use speculation to overcome ambiguous

dependences– Use hardware support to recover from mis-

speculation – E.g., multiscalar

• Speculatively acquire a lock

12

Transactions

• Simplify expression of unordered sequences• Very high overhead to implement semantics in

software• Hardware support for transactions will exist

– Speculative multithreading is transactions with restrictions on ordering

– No software overhead to implement semantics– More applications likely to be written with transactions

• Lots of similarities to speculative multithreading– Similar opportunities and limitations

13

Control-Driven vs. Data-Driven Models

• Sequential execution is control-driven at the instruction level– Instruction available to process (on ALU) when control gets

to it

• Traditional parallelization, speculative multithreading, transactions, etc., are also control-driven– Initiate execution of task/transaction when program control

reaches it– Concurrently-executing entities can be ordered or

unordered– Limits ability to reach distant parallelism

• Can we have a usable data-driven parallel execution model?

14

Out-of-Order Superscalar

• Instructions fetched in control-driven (sequential) order

• Instructions executed in data-driven order• Instructions committed in control-driven

(sequential) order• Low-level 2-4X parallel execution with

high-level sequential view• Maintaining high-level sequential view

critical to software and hardware development

15

Program Demultiplexing

• New opportunities for parallelism– High-level 2-4X parallelism

• Program is a multiplexing of methods (or functions) onto single control flow– Convenience of expression– Matched contemporary processing hardware

• De-multiplex methods of program• Execute methods in parallel in dataflow

fashion• Give appearance of ordered execution

16

Program Demultiplexing (PD)

3

1

4

5

2 6

7

PD

Execu

tion

Seq

uen

tial Pro

gra

m

• Program Demultiplexing– Programs

• Sequential

– Execution• Near-dataflow on

methods

– Nodes• Methods on processors

17

M()

Execution Framework

Seq

uen

tial Exec.

M()

Call Site

Handler

Call Site

Trigger

EB

Means of reaching distant parallelism – call

site

18

M()

Execution Framework

Seq

uen

tial Exec.

M()

Call Site

Handler

Call Site

Trigger

EB

Setup execution, provide

parameters

Begins execution of

M Execute speculatively on

auxiliary processor

Save execution in buffer pool

Search execution buffer pool

Use if valid entry found in buffer

pool

On Main processor

Demultiplexed execution on Auxiliary

processor

Assumption: Model used for compiled C programs

Overheads of demultiplexed

execution

Invalidate executions that violate data

dependencies

19

Example

wire_chg = cost - funccost ;

truth = acceptt (...);if( truth == 1 ) {

. . . new_assgnto_old2( . . .);

(60 cycles)

dbox_pos( atermptr ); dbox_pos( btermptr ); }

Trig 1 Trig 2

ucxx2.c in 300.twolf

mov %eax,0xffffffd0(%ebp) mov 0xffffffd0(%ebp),%edx mov %edx,(%esp,1)

dbox_pos(atermptr)

(40 cycles)

mov %eax,0xffffffcc(%ebp) mov 0xffffffcc(%ebp),%ecx mov %ecx,(%esp,1)

dbox_pos(btermptr)(40 cycles)

20

Execution Framework

• Handler per call site of M– Separates call site from program– May have control-flow

• Every call is a demultiplexed execution

• Trigger per call site– Usually fires when method, handler

ready– Begins demultiplexed execution(s)

• Unordered executions– Data-flow based

M ()

P ()

M ()P ()

Sequential Exec.

PD Exec.

21

Methods

• Well encapsulated– Defined by parameters and return value– Stack for local computation– Heap for global state

• Often performs specific tasks– Access limited global state

• Now: Don’t care how computation implemented

• Proposed: Don’t care where, when(?), and how computation carried out

22

Handlers

• Task– Begin demultiplexed execution(s) of a method– Providing parameters to the execution(s)

• Specifying handler– Not explicitly specified in program, but part of it

• Evaluating compiled sequential programs

• Generate handler from program– Slice of instructions from call site providing

parameters

23

Triggers

• Indicates readiness of method and handler– Data dependencies satisfied

• Fires when method and handler are ready

• Begins executing the handler

24

Demultiplexed Execution

• Demultiplexed execution– Immediately on auxiliary proc.

• Better Scheduling possible– With extra book-keeping

• Intelligent policiesEB

P1P1

CC

P2P2

CC

P3P3

CC

Auxiliary processors

Main Proc.

25

Reaching distant parallelism

crafty gap gzip mcf pars twolf vortex vpr

60 72 30 80 70 40 63 47

> 1 (%)

M2()

M1()

A

B BA

BA

26

Reliable Systems

• Software is unreliable and error-prone• Hardware will be unreliable and error-

prone• Improving hardware/software reliability will

result in significant software redundancy– Redundancy will be source of parallelism

27

Software Reliability & Security

Reliability & Security via Dynamic Monitoring- Many academic proposals for C/C++ code

- Ccured, Cyclone, SafeC, etc…- VM performs checks for Java/C#

High Overheads!- Encourages use of unsafe code

28

Instrumented Redundant Multithreading

• Insert instrumentation/checking functionality (redundancy) into code without commensurate performance impact

• Use parallelism to alleviate performance impact– Non-traditional model for parallelism

• Successful parallelization will encourage more use of novel (overhead) functionality

• Likely crucial techniques for overall system (software AND hardware) reliability

29

A

B

C

D

Execution Model

• Divide program into tasks

• Fork a monitor thread to check computation of each task

• Monitor thread instrumented with safety checking code

Monitoring Code

Pro

gram

B’

A’

C’

30

Task Commits & Aborts

• Commit/abort at task granularity

• Precise error detection achieved by reexecuting code w/ inlined checks

– Also provides precise exceptions

C

D

B

A

A’

COMMITB’

C’

B’’

C

ABORT

31

IRMT Implementation

• Hardware support– Hardware thread contexts– Register checkpointing– Speculative buffering– PC translation

• Software support– Task selection– Instrumentation

32

Other Opportunities

• Spreading single thread computation(s) to multiple processing cores– Special form of demultiplexing– Similar to cohort scheduling

• Example: user and OS, different interrupts, etc.– Significant instruction cache benefits– Significant branch predictor benefits– Potential data cache benefits

• Can this be done in a manner transparent to OS?

33

Summary

• CMPs will require as well as allow for innovative models for software concurrency

• Data-driven, method-level concurrency is a promising model– Likely good match for anticipated programming

styles

• Techniques for enhancing software and hardware reliability will afford new forms of concurrency– Now is the time to start thinking about future

opportunities for concurrency

Documents

Parallel Execution Models for Future Multicore Architectures Guri Sohi University of Wisconsin