44
CILK: An Efficient Multithreaded Runtime System

CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad

Embed Size (px)

Citation preview

Page 1: CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad

CILK: An Efficient Multithreaded Runtime System

Page 2: CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad

People

Project at MIT & now at UT Austin– Bobby Blumofe (now UT Austin, Akamai)

– Chris Joerg

– Brad Kuszmaul (now Yale)

– Charles Leiserson (MIT, Akamai)

– Keith Randall (Bell Labs)

– Yuli Zhou (Bell Labs)

Page 3: CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad

Outline

Introduction Programming environment The work-stealing thread scheduler Performance of applications Modeling performance Proven Properties Conclusions

Page 4: CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad

Introduction

Why multithreading?To implement dynamic, asynchronous,

concurrent programs. Cilk programmer optimizes:

– total work– critical path

A Cilk computation is viewed as a dynamic, directed acyclic graph (dag)

Page 5: CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad

Introduction ...

Page 6: CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad

Introduction ...

Cilk program is a set of procedures

A procedure is a sequence of threads

Cilk threads are:

– represented by nodes in the dag

– Non-blocking: run to completion: no waiting

or suspension: atomic units of execution

Page 7: CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad

Introduction ...

Threads can spawn child threads

– downward edges connect a parent to its

children

A child & parent can run concurrently.

– Non-blocking threads a child cannot return a

value to its parent.

– The parent spawns a successor that receives

values from its children

Page 8: CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad

Introduction ...

A thread & its successor are parts of the same Cilk procedure.– connected by horizontal arcs

Children’s returned values are received before their successor begins: – They constitute data dependencies.

– Connected by curved arcs

Page 9: CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad

Introduction ...

Page 10: CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad

Introduction: Execution Time

Execution time of a Cilk program using P processors depends on:– Work (T1): time for Cilk program with 1

processor to complete.

– Critical path (T): the time to execute

the longest directed path in the dag.

– TP >= T1 / P (not true for some searches)

– TP >= T

Page 11: CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad

Introduction: Scheduling

Cilk uses run time scheduling called work stealing.

Works well on dynamic, asynchronous, MIMD-style programs.

For “fully strict” programs, Cilk achieves asymptotic optimality for:

space, time, & communication

Page 12: CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad

Introduction: language

Cilk is an extension of C

Cilk programs are:

– preprocessed to C

– linked with a runtime library

Page 13: CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad

Programming Environment

Declaring a thread:

thread T ( <args> ) { <stmts> }

T is preprocessed into a C function of 1

argument and return type void.

The 1 argument is a pointer to a

closure

Page 14: CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad

Environment: Closure

A closure is a data structure that has:

– a pointer to the C function for T

– a slot for each argument (inputs & continuations)

– a join counter: count of the missing argument values

A closure is ready when join counter == 0.

A closure is waiting otherwise.

They are allocated from a runtime heap

Page 15: CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad

Environment: Continuation

A Cilk continuation is a data type, denoted by the keyword cont.

cont int x; It is a global reference to an empty

slot of a closure. It is implemented as 2 items:

– a pointer to the closure; (what thread)– an int value: the slot number. (what

input)

Page 16: CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad

Environment: Closure

Page 17: CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad

Environment: spawn

To spawn a child, a thread creates its closure:

spawn T (<args> )– creates child’s closure

– sets available arguments

– sets join counter

To specify a missing argument, prefix with a “?”

spawn T (k, ?x);

Page 18: CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad

Environment: spawn_next

A successor thread is spawned the

same way as a child, except the

keyword spawn_next is used:

spawn_next T(k, ?x)

Children typically have no missing

arguments; successors do.

Page 19: CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad

Explicit continuation passing

Nonblocking threads a parent cannot block on children’s results.

It spawns a successor thread. This communication paradigm is

called explicit continuation passing. Cilk provides a primitive to send a

value from one closure to another.

Page 20: CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad

send_argument

Cilk provides the primitivesend_argument( k, value )sends value to the argument slot of a

waiting closure specified by continuation k.

spawn

spawn_next

send_argument

parent

child

successor

Page 21: CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad

Cilk Procedure for computing a Fibonacci

numberthread int fib ( cont int k, int n ) { if ( n < 2 ) send_argument( k, n ); else { cont int x, y; spawn_next sum ( k, ?x, ?y ); spawn fib ( x, n - 1 ); spawn fib ( y, n - 2 );

}}thread sum ( cont int k, int x, int y ) { send_argument ( k, x + y ); }

Page 22: CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad

Nonblocking Threads:

Advantages

Shallow call stack. (for us: fault tolerance )

Simplify runtime system:

Completed threads leave C runtime stack empty.

Portable runtime implementation

Page 23: CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad

Nonblocking Threads: Disdvantages

Burdens programmer with explicit

continuation passing.

Page 24: CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad

Work-Stealing Scheduler The concept of work-stealing goes at

least as far back as 1981. Work-stealing:

– a process with no work selects a victim from which to get work.

– it gets the shallowest thread in the victim’s spawn tree.

In Cilk, thieves choose victims randomly.

Page 25: CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad

Thread Level

Page 26: CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad

Stealing Work: The Ready Deque

Each closure has a level:– level( child ) = level( parent ) + 1

– level( successor ) = level( parent )

Each processor maintains a ready deque:– Contains ready closures

– The Lth element contains the list of all ready closures whose level is L.

Page 27: CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad

Ready deque

if ( ! readyDeque .isEmpty()

)

take deepest thread

else

steal shallowest thread

from readyDeque of

randomly selected victim

Page 28: CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad

Why Steal Shallowest closure?

Shallow threads probably produce more work,

therefore, reduce communication.

Shallow threads more likely to be on critical

path.

Page 29: CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad

Readying a Remote Closure

If a send_argument makes a remote closure

ready,

put closure on sending processor’s readyDeque

extra communication.

– Done to make scheduler provably good

– Putting on local readyDeque works well in practice.

Page 30: CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad

Performance of Application

Tserial = time for C program

T1 = time for 1-processor Cilk program

Tserial /T1 = efficiency of the Cilk program

– Efficiency is close to 1 for programs with

moderately long threads: Cilk overhead is small.

Page 31: CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad

Performance of Applications

T1/TP = speedup

T1/ T = average parallelism

If average parallelism is large

then speedup is nearly perfect.

If average parallelism is small

then speedup is much smaller.

Page 32: CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad

Performance Data

Page 33: CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad

Performance of Applications

Application speedup = efficiency X

speedup

= ( Tserial /T1 ) X ( T1/TP ) = Tserial / TP

Page 34: CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad

Modeling Performance

TP >= max( T , T1 / P )

A good scheduler should come

close to these lower bounds.

Page 35: CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad

Modeling Performance

Empirical data suggests that for Cilk:

TP c1 T1 / P + c T ,

where c1 1.067 & c 1.042

If T1 / T > 10P

then critical path does not affect TP.

Page 36: CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad

Proven Property: Time

Time: Including overhead,

TP = O( T1/P + T ),

which is asymptotically optimal

Page 37: CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad

Conclusions We can predict the performance of a Cilk

program by observing machine-independent characteristics: – Work

– Critical path

when the program is fully-strict. Cilk’s usefulness is unclear for other

kinds of programs (e.g., iterative programs).

Page 38: CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad

Conclusions ...

Explicit continuation passing a

nuisance.

It subsequently was removed (with more

clever pre-processing).

Page 39: CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad

Conclusions ...

Great system research has a theoretical underpinning.

Such research identifies important properties– of the systems themselves, or– of our ability to reason about them formally.

Cilk identified 3 significant system properties:– Fully strict programs– Non-blocking threads– Randomly choosing a victim.

Page 40: CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad

END

Page 41: CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad

The Cost of Spawns

A spawn is about an order of magnitude more

costly than a C function call.

Spawned threads running on parent’s processor

can be implemented more efficiently than

remote spawns.

– This usually is the case.

Compiler techniques can exploit this distinction.

Page 42: CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad

Communication Efficiency

A request is an attempt to steal work

(the victim may not have work).

Requests/processor & steals/processor

both grow as the critical path grows.

Page 43: CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad

Proven Properties: Space

A fully strict program’s threads send arguments only to its parent’s successors.

For such programs, space, time, & communication bounds are proven.

Space: SP <= S1 P.

– There exists a P-processor execution for which this is asymptotically optimal.

Page 44: CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad

Proven Properties: Communication

Communication: The expected # of bits

communicated in a P-processor execution is:

O( T P SMAX )

where SMAX denotes its largest closure.

There exists a program such that, for all P, there

exists a P-processor execution that communicates

k bits, where k > c T P SMAX, for some constant, c.