Upload
duscha
View
34
Download
1
Embed Size (px)
DESCRIPTION
Processor-oblivious parallel algorithms and scheduling Illustration on parallel prefix. Jean-Louis Roch , Daouda Traore INRIA-CNRS Moais team - LIG Grenoble, France. Contents I. What is a processor-oblivious parallel algorithm ? II. Work-stealing scheduling of parallel algorithms - PowerPoint PPT Presentation
Citation preview
Processor-oblivious parallel algorithms
and scheduling
Illustration on parallel prefix Jean-Louis Roch, Daouda Traore
INRIA-CNRS Moais team - LIG Grenoble, France
Contents
I. What is a processor-oblivious parallel algorithm ? II. Work-stealing scheduling of parallel algorithms III. Processor-oblivious parallel prefix computation
Workshop “Scheduling Algorithms for New Emerging Applications” - CIRM Luminy -May 29th-June 2nd, 2006
Dynamic architecture : non-fixed number of resources, variable speedseg: grid, … but not only: SMP server in multi-users mode
The problemProblem: compute f(a)
Sequentialalgorithm
parallelP=2
parallelP=100
parallelP=max
...
Multi-user SMP server GridHeterogeneous network
?Which algorithm to choose ?
… …
Dynamic architecture : non-fixed number of resources, variable speedseg: grid, … but not only: SMP server in multi-users mode
=> motivates « processor-oblivious » parallel algorithm that : + is independent from the underlying architecture:
no reference to p nor i(t) = speed of processor i at time t nor …
+ on a given architecture, has performance guarantees : behaves as well as an optimal (off-line, non-oblivious) one
Problem: often, the larger the parallel degree, the larger the #operations to perform !
Processor-oblivious algorithms
• Prefix problem : • input : a0, a1, …, an • output : 0, 1, …, n with
• Sequential algorithm : for (i= 0 ; i <= n; i++ ) [ i ] = [ i – 1 ] * a [ i ] ;
• Fine grain optimal parallel algorithm [Ladner-Fischer]:
Prefix computation
Critical time W =2. log n
but performs W1 = 2.n ops
Twice more expensive than the sequential …
a0 a1 a2 a3 a4 … an-1 an
* * **
Prefix of size n/2 1 3 … n
2 4 … n-1
** *
performs W1 = W = n operations
• Any parallel algorithm with critical time W runs on p processors in time
– strict lower bound : block algorithm + pipeline [Nicolau&al. 1996]
–Question : How to design a generic parallel algorithm, independent from the architecture, that achieves optimal performance on any given architecture ? –> to design a malleable algorithm where scheduling suits the number of operations performed to the architecture
Prefix computation : an example where parallelism always costs
- Heterogeneous processors with changing speed [Bender-Rabin02]
=> i(t) = instantaneous speed of processor i at time t in #operations per second
- Average speed per processor for a computation with duration T :
- Lower bound for the time of prefix computation :
Architecture model
Work-stealing (1/2)
«Depth »
W = #ops on a critical path
(parallel time on resources)
• Workstealing = “greedy” schedule but distributed and randomized
• Each processor manages locally the tasks it creates• When idle, a processor steals the oldest ready task on a
remote -non idle- victim processor (randomly chosen)
« Work »
W1= #total
operations performed
Work-stealing (2/2)
«Depth »
W = #ops on a critical path
(parallel time on resources)
« Work »
W1= #total
operations performed
• Interests : -> suited to heterogeneous architectures with slight modification [Bender-Rabin02]
-> if W small enough near-optimal processor-oblivious schedule with good probability on p processors with average speeds ave
NB : #succeeded steals = #task migrations < p W [Blumofe 98, Narlikar 01, Bender 02]
• Implementation: work-first principle [Cilk serie-parallel, Kaapi dataflow] -> Move scheduling overhead on the steal operations (infrequent case)-> General case : “local parallelism” implemented by sequential function call
• General approach: to mix both • a sequential algorithm with optimal work W1 • and a fine grain parallel algorithm with minimal critical time W
• Folk technique : parallel, than sequential • Parallel algorithm until a certain « grain »; then use the sequential one• Drawback : W increases ;o) …and, also, the number of steals
• Work-preserving speed-up technique [Bini-Pan94] sequential, then parallel Cascading [Jaja92] : Careful interplay of both algorithms to build one with both
W small and W1 = O( Wseq ) • Use the work-optimal sequential algorithm to reduce the size • Then use the time-optimal parallel algorithm to decrease the time
Drawback : sequential at coarse grain and parallel at fine grain ;o(
How to get both optimal work W1 and W small?
Alternative : concurrently sequential and parallelBased on the Work-first principle : Executes always a sequential algorithm to reduce parallelism overhead
use parallel algorithm only if a processor becomes idle (ie steals) by extracting parallelism from a sequential computation
Hypothesis : two algorithms : • - 1 sequential : SeqCompute
- 1 parallel : LastPartComputation : at any time, it is possible to extract parallelism from the remaining computations of the sequential algorithm
– Self-adaptive granularity based on work-stealingSeqCompute
Extract_parLastPartComputation
SeqCompute
Parallel
Sequential
0 a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12
Work-stealer 1
MainSeq.
Work-stealer 2
Adaptive Prefix on 3 processors
1
Steal request
Parallel
Sequential
Adaptive Prefix on 3 processors
0 a1 a2 a3 a4
Work-stealer 1
MainSeq. 1
Work-stealer 2
a5 a6 a7 a8
a9 a10 a11 a127
3
Steal request
2
6 i=a5*…*ai
Parallel
Sequential
Adaptive Prefix on 3 processors
0 a1 a2 a3 a4
Work-stealer 1
MainSeq. 1
Work-stealer 2
a5 a6 a7 a8
7
3 42
6 i=a5*…*ai
a9 a10 a11 a12
8
4
Preempt
10 i=a9*…*ai
8
8
Parallel
Sequential
Adaptive Prefix on 3 processors
0 a1 a2 a3 a4 8
Work-stealer 1
MainSeq. 1
Work-stealer 2
a5 a6 a7 a8
7
3 42
6 i=a5*…*ai
a9 a10 a11 a12
85
10 i=a9*…*ai9
6
11
8
Preempt 11
118
Parallel
Sequential
Adaptive Prefix on 3 processors
0 a1 a2 a3 a4 8 11 a12
Work-stealer 1
MainSeq. 1
Work-stealer 2
a5 a6 a7 a8
7
3 42
6 i=a5*…*ai
a9 a10 a11 a12
85
10 i=a9*…*ai9
6
11
12
10
7
118
Parallel
Sequential
Adaptive Prefix on 3 processors
0 a1 a2 a3 a4 8 11 a12
Work-stealer 1
MainSeq. 1
Work-stealer 2
a5 a6 a7 a8
7
3 42
6 i=a5*…*ai
a9 a10 a11 a12
85
10 i=a9*…*ai9
6
11
12
10
7
118
Implicit critical path on the sequential process
Analysis of the algorithm • Execution time
• Sketch of the proof :– Dynamic coupling of two algorithms that completes simultaneously:– Sequential: (optimal) number of operations S on one processor– Parallel : minimal time but performs X operations on other processors
• dynamic splitting always possible till finest grain BUT local sequential– Critical path small ( eg : log X)– Each non constant time task can potentially be splitted (variable speeds)
– Algorithmic scheme ensures Ts = Tp + O(log X)=> enables to bound the whole number X of operations performedand the overhead of parallelism = (s+X) - #ops_optimal
Lower bound
Adaptive prefix : experiments1
Single-user context : processor-oblivious prefix achieves near-optimal performance : - close to the lower bound both on 1 proc and on p processors
- Less sensitive to system overhead : even better than the theoretically “optimal” off-line parallel algorithm on p processors :
Optimal off-line on p procs
Oblivious
Prefix sum of 8.106 double on a SMP 8 procs (IA64 1.5GHz/ linux)T
ime
(s)
#processors
Pure sequential
Single user context
Adaptive prefix : experiments 2
Multi-user context : Additional external charge: (9-p) additional external dummy processes are concurrently executed Processor-oblivious prefix computation is always the fastest 15% benefit over a parallel algorithm for p processors with off-line schedule,
Multi-user context : Additional external charge: (9-p) additional external dummy processes are concurrently executed Processor-oblivious prefix computation is always the fastest 15% benefit over a parallel algorithm for p processors with off-line schedule,
External charge (9-p external processes)
Off-line parallel algorithm for p processors
Oblivious
Prefix sum of 8.106 double on a SMP 8 procs (IA64 1.5GHz/ linux)
Tim
e (s
)
#processors
Multi-user context :
Conclusion
The interplay of an on-line parallel algorithm directed by work-stealing schedule is useful for the design of processor-oblivious algorithms
Application to prefix computation : - theoretically reaches the lower bound on heterogeneous processors
with changing speeds - practically, achieves near-optimal performances on multi-user SMPs
Generic adaptive scheme to implement parallel algorithms with provable performance
- work in progress : parallel 3D reconstruction [oct-tree scheme with deadline constraint]
Thank you !
QuickTime™ et undécompresseur codec YUV420
sont requis pour visionner cette image.
Interactive Distributed Simulation[B Raffin &E Boyer]
- 5 cameras, - 6 PCs
3D-reconstruction+ simulation+ rendering
->Adaptive scheme to maximize 3D-reconstruction precision within fixed timestamp
The Prefix race: sequential/parallel fixed/ adaptive
Race between 9 algorithms (44 processes) on an octo-SMPSMP
0 5 10 15 20 25
1
2
3
4
5
6
7
8
9
Execution time (seconds)
Série1
Adaptative 8 proc.
Parallel 8 proc.
Parallel 7 proc.
Parallel 6 proc.Parallel 5 proc.
Parallel 4 proc.
Parallel 3 proc.
Parallel 2 proc.
Sequential
On each of the 10 executions, adaptive completes first
Adaptive prefix : some experiments
Single user contextAdaptive is equivalent to:
- sequential on 1 proc - optimal parallel-2 proc. on 2 processors - … - optimal parallel-8 proc. on 8 processors
Multi-user contextAdaptive is the fastest15% benefit over a static grain algorithm
Multi-user contextAdaptive is the fastest15% benefit over a static grain algorithm
External charge
Parallel
Adaptive
Parallel
Adaptive
Prefix of 10000 elements on a SMP 8 procs (IA64 / linux)
#processorsT
ime
(s)
Tim
e (s
)
#processors
With * = double sum ( r[i]=r[i-1] + x[i] )
Single user Processors with variable speeds
Remark for n=4.096.000 doubles :- “pure” sequential : 0,20 s- minimal ”grain” = 100 doubles : 0.26s on 1 proc
and 0.175 on 2 procs (close to lower bound)
Finest “grain” limited to 1 page = 16384 octets = 2048 double