Upload
clemence-garrison
View
217
Download
2
Embed Size (px)
Citation preview
CILK: An Efficient Multithreaded Runtime System
People
Project at MIT & now at UT Austin– Bobby Blumofe (now UT Austin, Akamai)
– Chris Joerg
– Brad Kuszmaul (now Yale)
– Charles Leiserson (MIT, Akamai)
– Keith Randall (Bell Labs)
– Yuli Zhou (Bell Labs)
Outline
Introduction Programming environment The work-stealing thread scheduler Performance of applications Modeling performance Proven Properties Conclusions
Introduction
Why multithreading?To implement dynamic, asynchronous,
concurrent programs. Cilk programmer optimizes:
– total work– critical path
A Cilk computation is viewed as a dynamic, directed acyclic graph (dag)
Introduction ...
Introduction ...
Cilk program is a set of procedures
A procedure is a sequence of threads
Cilk threads are:
– represented by nodes in the dag
– Non-blocking: run to completion: no waiting
or suspension: atomic units of execution
Introduction ...
Threads can spawn child threads
– downward edges connect a parent to its
children
A child & parent can run concurrently.
– Non-blocking threads a child cannot return a
value to its parent.
– The parent spawns a successor that receives
values from its children
Introduction ...
A thread & its successor are parts of the same Cilk procedure.– connected by horizontal arcs
Children’s returned values are received before their successor begins: – They constitute data dependencies.
– Connected by curved arcs
Introduction ...
Introduction: Execution Time
Execution time of a Cilk program using P processors depends on:– Work (T1): time for Cilk program with 1
processor to complete.
– Critical path (T): the time to execute
the longest directed path in the dag.
– TP >= T1 / P (not true for some searches)
– TP >= T
Introduction: Scheduling
Cilk uses run time scheduling called work stealing.
Works well on dynamic, asynchronous, MIMD-style programs.
For “fully strict” programs, Cilk achieves asymptotic optimality for:
space, time, & communication
Introduction: language
Cilk is an extension of C
Cilk programs are:
– preprocessed to C
– linked with a runtime library
Programming Environment
Declaring a thread:
thread T ( <args> ) { <stmts> }
T is preprocessed into a C function of 1
argument and return type void.
The 1 argument is a pointer to a
closure
Environment: Closure
A closure is a data structure that has:
– a pointer to the C function for T
– a slot for each argument (inputs & continuations)
– a join counter: count of the missing argument values
A closure is ready when join counter == 0.
A closure is waiting otherwise.
They are allocated from a runtime heap
Environment: Continuation
A Cilk continuation is a data type, denoted by the keyword cont.
cont int x; It is a global reference to an empty
slot of a closure. It is implemented as 2 items:
– a pointer to the closure; (what thread)– an int value: the slot number. (what
input)
Environment: Closure
Environment: spawn
To spawn a child, a thread creates its closure:
spawn T (<args> )– creates child’s closure
– sets available arguments
– sets join counter
To specify a missing argument, prefix with a “?”
spawn T (k, ?x);
Environment: spawn_next
A successor thread is spawned the
same way as a child, except the
keyword spawn_next is used:
spawn_next T(k, ?x)
Children typically have no missing
arguments; successors do.
Explicit continuation passing
Nonblocking threads a parent cannot block on children’s results.
It spawns a successor thread. This communication paradigm is
called explicit continuation passing. Cilk provides a primitive to send a
value from one closure to another.
send_argument
Cilk provides the primitivesend_argument( k, value )sends value to the argument slot of a
waiting closure specified by continuation k.
spawn
spawn_next
send_argument
parent
child
successor
Cilk Procedure for computing a Fibonacci
numberthread int fib ( cont int k, int n ) { if ( n < 2 ) send_argument( k, n ); else { cont int x, y; spawn_next sum ( k, ?x, ?y ); spawn fib ( x, n - 1 ); spawn fib ( y, n - 2 );
}}thread sum ( cont int k, int x, int y ) { send_argument ( k, x + y ); }
Nonblocking Threads:
Advantages
Shallow call stack. (for us: fault tolerance )
Simplify runtime system:
Completed threads leave C runtime stack empty.
Portable runtime implementation
Nonblocking Threads: Disdvantages
Burdens programmer with explicit
continuation passing.
Work-Stealing Scheduler The concept of work-stealing goes at
least as far back as 1981. Work-stealing:
– a process with no work selects a victim from which to get work.
– it gets the shallowest thread in the victim’s spawn tree.
In Cilk, thieves choose victims randomly.
Thread Level
Stealing Work: The Ready Deque
Each closure has a level:– level( child ) = level( parent ) + 1
– level( successor ) = level( parent )
Each processor maintains a ready deque:– Contains ready closures
– The Lth element contains the list of all ready closures whose level is L.
Ready deque
if ( ! readyDeque .isEmpty()
)
take deepest thread
else
steal shallowest thread
from readyDeque of
randomly selected victim
Why Steal Shallowest closure?
Shallow threads probably produce more work,
therefore, reduce communication.
Shallow threads more likely to be on critical
path.
Readying a Remote Closure
If a send_argument makes a remote closure
ready,
put closure on sending processor’s readyDeque
extra communication.
– Done to make scheduler provably good
– Putting on local readyDeque works well in practice.
Performance of Application
Tserial = time for C program
T1 = time for 1-processor Cilk program
Tserial /T1 = efficiency of the Cilk program
– Efficiency is close to 1 for programs with
moderately long threads: Cilk overhead is small.
Performance of Applications
T1/TP = speedup
T1/ T = average parallelism
If average parallelism is large
then speedup is nearly perfect.
If average parallelism is small
then speedup is much smaller.
Performance Data
Performance of Applications
Application speedup = efficiency X
speedup
= ( Tserial /T1 ) X ( T1/TP ) = Tserial / TP
Modeling Performance
TP >= max( T , T1 / P )
A good scheduler should come
close to these lower bounds.
Modeling Performance
Empirical data suggests that for Cilk:
TP c1 T1 / P + c T ,
where c1 1.067 & c 1.042
If T1 / T > 10P
then critical path does not affect TP.
Proven Property: Time
Time: Including overhead,
TP = O( T1/P + T ),
which is asymptotically optimal
Conclusions We can predict the performance of a Cilk
program by observing machine-independent characteristics: – Work
– Critical path
when the program is fully-strict. Cilk’s usefulness is unclear for other
kinds of programs (e.g., iterative programs).
Conclusions ...
Explicit continuation passing a
nuisance.
It subsequently was removed (with more
clever pre-processing).
Conclusions ...
Great system research has a theoretical underpinning.
Such research identifies important properties– of the systems themselves, or– of our ability to reason about them formally.
Cilk identified 3 significant system properties:– Fully strict programs– Non-blocking threads– Randomly choosing a victim.
END
The Cost of Spawns
A spawn is about an order of magnitude more
costly than a C function call.
Spawned threads running on parent’s processor
can be implemented more efficiently than
remote spawns.
– This usually is the case.
Compiler techniques can exploit this distinction.
Communication Efficiency
A request is an attempt to steal work
(the victim may not have work).
Requests/processor & steals/processor
both grow as the critical path grows.
Proven Properties: Space
A fully strict program’s threads send arguments only to its parent’s successors.
For such programs, space, time, & communication bounds are proven.
Space: SP <= S1 P.
– There exists a P-processor execution for which this is asymptotically optimal.
Proven Properties: Communication
Communication: The expected # of bits
communicated in a P-processor execution is:
O( T P SMAX )
where SMAX denotes its largest closure.
There exists a program such that, for all P, there
exists a P-processor execution that communicates
k bits, where k > c T P SMAX, for some constant, c.