Mini Symposium Adaptive Algorithms for Scientific computing 9h45 Adaptive algorithms - Theory and applications Jean-Louis Roch &al. AHA Team INRIA-CNRS

Mini Symposium

Adaptive Algorithms for Scientific computing

•9h45 Adaptive algorithms - Theory and applicationsJean-Louis Roch &al. AHA Team INRIA-CNRS Grenoble, France

•10h15 Hybrids in exact linear algebraDave Saunders U. Delaware, USA

•10h45 Adaptive programming with hierarchical multiprocessor tasksThomas Rauber, Gudula Runger, U. Bayreuth, Germany

•11h15 Cache-Oblivious algorithmsMichael Bender, Stony Brook U., USA

• Adaptive, hybrids, oblivious : what do those terms mean ?• Taxonomy of autonomic computing [Ganek & Corbi 2003] :

– Self-configuring / self-healing / self-optimising / self-protecting

• Objective: towards an analysis based on the algorithm performance

Adaptive algorithmsTheory and applications

Van Dat Cung, Jean-Guillaume Dumas, Thierry Gautier, Guillaume Huard, Bruno Raffin, Jean-Louis Roch, Denis Trystram

IMAG-INRIA Workgroup on “Adaptive and Hybrid Algorithms” Grenoble, France

Contents

I. Some criteria to analyze adaptive algorithms II. Work-stealing and adaptive parallel algorithms III. Adaptive parallel prefix computation

Why adaptive algorithms and how?

€

7 3 6

0 1 8

0 0 5

⎡

⎣

⎢ ⎢ ⎢

⎤

⎦

⎥ ⎥ ⎥

Input data varyResources availability is versatile

Adaptation to improve performances

Scheduling• partitioning • load-balancing• work-stealing

Measures onresources

Measures on data

Calibration • tuning parameters block size/ cache choice of instructions, …• priority managing

Choices in the algorithm • sequential / parallel(s) • approximated / exact• in memory / out of core• …

An algorithm is « hybrid » iff there is a choice at a high level between at least two algorithms, each of them could solve the same problem

Modeling an hybrid algorithm

• Several algorithms to solve a same problem f : – Eg : algo_f1, algo_f2(block size), … algo_fk : – each algo_fk being recursive

Adaptationto choose algo_fj for

each call to f

algo_fi ( n, … ) { …. f ( n - 1, … ) ; …. f ( n / 2, … ) ; …}

.

E.g. “practical” hybrids: • Atlas, Goto, FFPack• FFTW• cache-oblivious B-tree• any parallel program with scheduling support: Cilk, Athapascan/Kaapi, Nesl,TLib…

• How to manage overhead due to choices ? • Classification 1/2 :

– Simple hybrid iff O(1) choices [eg block size in Atlas, …]

– Baroque hybrid iff an unbounded number of choices

[eg recursive splitting factors in FFTW]

• choices are either dynamic or pre-computed based on input properties.

• Choices may or may not be based on architecture parameters.

• Classification 2/2. :

an hybrid is– Oblivious: control flow does not depend neither on static properties of the resources nor on the input

[eg cache-oblivious algorithm [Bender]

– Tuned : strategic choices are based on static parameters [eg block size w.r.t cache, granularity, ]

• Engineered tuned or self tuned[eg ATLAS and GOTO libraries, FFTW, …][eg [LinBox/FFLAS] [ Saunders&al]

– Adaptive : self-configuration of the algorithm, dynamlc• Based on input properties or resource circumstances discovered at run-time

[eg : idle processors, data properties, …] [eg TLib Rauber&Rünger]

Examples

• BLAS libraries– Atlas: simple tuned (self-tuned)– Goto : simple engineered (engineered tuned)– LinBox / FFLAS : simple self-tuned,adaptive [Saunders&al]

• FFTW– Halving factor : baroque tuned– Stopping criterion : simple tuned

• Parallel algorithm and scheduling :– Choice of parallel degree : eg Tlib [Rauber&Rünger] – Work-stealing schedile : baroque hybrid


Van Dat Cung, Jean-Guillaume Dumas, Thierry Gautier,Guillaume Huard, Bruno Raffin, Jean-Louis Roch, Denis Trystram

INRIA-CNRS Project on“Adaptive and Hybrid Algorithms” Grenoble, France

Contents

I. Some criteria to analyze for adaptive algorithms II. Work-stealing and adaptive parallel algorithms III. Adaptive parallel prefix computation

Work-stealing (1/2)

«Depth »

W = #ops on a critical path

(parallel time on resources)

• Workstealing = “greedy” schedule but distributed and

randomized

• Each processor manages locally the tasks it creates• When idle, a processor steals the oldest ready task on a

remote -non idle- victim processor (randomly chosen)

« Work »

W1= #total

operations performed

Work-stealing (2/2)

«Depth »

W = #ops on a critical path

(parallel time on resources)

« Work »

W1= #total

operations performed

• Interests :-> suited to heterogeneous architectures with slight modification [Bender-

Rabin02]

-> with good probability, near-optimal schedule on p processors with average speeds ave

Tp < W1/(p ave) + O ( W / ave )

NB : #succeeded steals = #task migrations < p W [Blumofe 98, Narlikar 01, Bender 02]

• Implementation: work-first principle [Cilk, Kaapi]• Local parallelism is implemented by sequential function call• Restrictions to ensure validity of the default sequential schedule

- serie-parallel/Cilk - reference order/Kaapi

Work-stealing and adaptability • Work-stealing ensures allocation of processors to tasks

transparently to the application with provable performances• Support to addition of new resources• Support to resilience of resources and fault-tolerance (crash faults, network, …)

• Checkpoint/restart mechanisms with provable performances [Porch, Kaapi, …]

• “Baroque hybrid” adaptation: there is an -implicit- dynamic choice between two algorithms

• a sequential (local) algorithm : depth-first (default choice)• A parallel algorithm : breadth-first• Choice is performed at runtime, depending on resource idleness

• Well suited to applications where a fine grain parallel algorithm is also a good sequential algorithm [Cilk]:

• Parallel Divide&Conquer computations • Tree searching, Branch&X …

-> suited when both sequential and parallel algorithms perform (almost) the same number of operations

• Solution: to mix both a sequential and a parallel algorithm

• Basic technique : • Parallel algorithm until a certain « grain »; then use the sequential one• Problem : W increases

also, the number of migration … and the inefficiency ;o(

• Work-preserving speed-up [Bini-Pan 94] = cascading [Jaja92]

Careful interplay of both algorithms to build one withboth W small and W1 = O( Wseq )

• Divide the sequential algorithm into block• Each block is computed with the (non-optimal) parallel algorithm• Drawback : sequential at coarse grain and parallel at fine grain ;o(

• Adaptive granularity : dual approach : • Parallelism is extracted at run-time from any sequential task

But often parallelism has a cost !

Self-adaptive grain algorithmBased on the Work-first principle :

Executes always a sequential algorithm to reduce parallelism overhead=> use parallel algorithm only if a processor becomes idle by

extracting parallelism from a sequential computation

Hypothesis : two algorithms : • - 1 sequential : SeqCompute

- 1 parallel : LastPartComputation : at any time, it is possible to extract parallelism from the remaining computations of the sequential algorithm

– Examples : - iterated product [Vernizzi 05] - gzip / compression [Kerfali 04]- MPEG-4 / H264 [Bernard 06] - prefix computation [Traore 06]

SeqCompute

Extract_parLastPartComputation

SeqCompute


Van Dat Cung, Jean-Guillaume Dumas, Thierry Gautier,Guillaume Huard, Bruno Raffin, Jean-Louis Roch, Denis Trystram

INRIA-CNRS Project on“Adaptive and Hybrid Algorithms” Grenoble, France

Contents

I. Some criteria to analyze for adaptive algorithms II. Work-stealing and adaptive parallel algorithms III. Adaptive parallel prefix computation

• Sequential algorithm : for (i= 0 ; i <= n; i++ ) [ i ] = [ i – 1 ] * a [ i ] ;

• Parallel algorithm [Ladner-Fischer]:

Prefix computation : an example where parallelism always costs

1 = a0*a1 2=a0*a1*a2 … n=a0*a1*…*an

W =2. log n

but W1 = 2.n

Twice more expensive than the sequential …

a0 a1 a2 a3 a4 … an-1 an

* * **

Prefix of size n/2 1 3 … n

2 4 … n-1

** *

W1 = W = n

Adaptive prefix computation

– Any (parallel) prefix performs at least W1 2.n - W ops

– Strict-lower bound on p identical processors: Tp 2n/(p+1)

block algorithm + pipeline [Nicolau&al. 2000]

Application of adaptive scheme :– One process performs the main “sequential” computation

– Other work-stealer processes computes parallel « segmented » prefix

–Near-optimal performance on processors with changing speeds :

Tp < 2n/((p+1). ave) + O ( log n / ave) lower bound

Scheme of the proof• Dynamic coupling of two algorithms that completes simultaneously:

– Sequential: (optimal) number of operations S– Parallel : : performs X operations

• dynamic splitting always possible till finest grain BUT local sequential• Scheduled by workstealing on p-1 processors

– Critical path small (log X)– Each non constant time task can be splitted (variable speeds)

• Analysis :

• Algorithmic scheme ensures Ts = Tp + O(log X)=> enables to bound the whole number X of operations performedand the overhead of parallelism = (s+X) - #ops_optimal

• Comparison to the lower bound on the number of operations.

0 a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12

Work-stealer 1

MainSeq.

Work-stealer 2

Adaptive Prefix on 3 processors

1

Steal request


0 a1 a2 a3 a4

Work-stealer 1

MainSeq. 1

Work-stealer 2

a5 a6 a7 a8

a9 a10 a11 a127

3

Steal request

2

6 i=a5*…*ai


0 a1 a2 a3 a4

Work-stealer 1

MainSeq. 1

Work-stealer 2

a5 a6 a7 a8

7

3 42

6 i=a5*…*ai

a9 a10 a11 a12

8

4

Preempt

10 i=a9*…*ai

8

8


0 a1 a2 a3 a4 8

Work-stealer 1

MainSeq. 1

Work-stealer 2

a5 a6 a7 a8

7

3 42

6 i=a5*…*ai

a9 a10 a11 a12

85

10 i=a9*…*ai9

6

11

8

Preempt 11

118


0 a1 a2 a3 a4 8 11 a12

Work-stealer 1

MainSeq. 1

Work-stealer 2

a5 a6 a7 a8

7

3 42

6 i=a5*…*ai

a9 a10 a11 a12

85

10 i=a9*…*ai9

6

11

12

10

7

118


0 a1 a2 a3 a4 8 11 a12

Work-stealer 1

MainSeq. 1

Work-stealer 2

a5 a6 a7 a8

7

3 42

6 i=a5*…*ai

a9 a10 a11 a12

85

10 i=a9*…*ai9

6

11

12

10

7

118

Implicit critical path on the sequential process

Adaptive prefix : some experiments

Single user contextAdaptive is equivalent to:

- sequential on 1 proc - optimal parallel-2 proc. on 2 processors - … - optimal parallel-8 proc. on 8 processors

Multi-user contextAdaptive is the fastest15% benefit over a static grain algorithm

Multi-user contextAdaptive is the fastest15% benefit over a static grain algorithm

External charge

Parallel

Adaptive

Parallel

Adaptive

Prefix of 10000 elements on a SMP 8 procs (IA64 / linux)

#processorsT

ime

(s)

Tim

e (s

)

#processors

Joint work with Daouda Traore

The Prefix race: sequential/parallel fixed/ adaptive

Race between 9 algorithms (44 processes) on an octo-SMPSMP

0 5 10 15 20 25

1

2

3

4

5

6

7

8

9

Execution time (seconds)

Série1

Adaptative 8 proc.

Parallel 8 proc.

Parallel 7 proc.

Parallel 6 proc.Parallel 5 proc.

Parallel 4 proc.

Parallel 3 proc.

Parallel 2 proc.

Sequential

On each of the 10 executions, adaptive completes first

With * = double sum ( r[i]=r[i-1] + x[i] )

Single user Processors with variable speeds

Remark for n=4.096.000 doubles :- “pure” sequential : 0,20 s- minimal ”grain” = 100 doubles : 0.26s on 1 proc

and 0.175 on 2 procs (close to lower bound)

Finest “grain” limited to 1 page = 16384 octets = 2048 double

E.g.Triangular system solving0 .x = b

• Sequential algorithm : T1 = n2/2; T = n (fine grain)

0 .x = b

A

1/ x1 = - b1 / a11

2/ For k=2..n bk = bk - ak1.x1

0 .x = b

system of dimension n-1

system of dimension n

E.g.Triangular system solving0 .x = b

• Sequential algorithm : T1 = n2/2; T = n (fine grain)

• Using parallel matrix inversion : T1 = n3; T = log2 n (fine grain)

0

A21 A22

A11

-1

=0

S A22

A11

-1

-1

S= -A22.A21.A11

-1 -1with A =-1

and x=A-1.b

• Self-adaptive granularity algorithm : T1 = n2; T = n.log n

0 .x = b

ExtractParand self-adaptive scalar product

self adaptive sequential algorithm

self-adaptivematrix inversion

choice of h = m

hm

Conclusion

Adaptive : what choices and how to choose ? Illustration : Adaptive parallel prefix based on work-stealing - self-tuned baroque hybrid : O(p log n ) choices

- achieves near-optimal performance processor oblivious

Generic adaptive scheme to implement parallel algorithms with provable performance

Mini Symposium

Adaptive Algorithms for Scientific computing

•9h45 Adaptive algorithms - Theory and applicationsJean-Louis Roch &al. AHA Team INRIA-CNRS Grenoble, France

•10h15 Hybrids in exact linear algebraDave Saunders, U. Delaware, USA

•10h45 Adaptive programming with hierarchical multiprocessor tasksThomas Rauber, U. Bayreuth, Germany

•11h15 Cache-Obloivious algorithmsMichael Bender, Stony Brook U., USA

• Adaptive, hybrids, oblivious : what do those terms mean ?• Taxonomy of autonomic computing [Ganek & Corbi 2003] :

– Self-configuring / self-healing / self-optimising / self-protecting

• Objective: towards an analysis based on the algorithm performance

Questions ?

Some examples (1/2)• Adaptive algorithms used empirically an theoretically :

– Atlas [2001] dense linear algebra library• Instruction set and instruction schedule • Self-camobration pg yjr blpvk idr §<“§!uuuuuuuuuu de la taille des blocs à l’installation sur la machine

– FFTW (1998, … ) ; FFT (n) <= p FFT(q) and q FFT(n) • For any n, for any recursive call FFT(n) : pre-compite the nest value for p • Pré-calcul de la découpe optimale pour la taille n du vecteur sur la machine

– Cache-oblivious B-trees : • Block recursive splitting to minimize #page faults• Self adaptation to memory hierarchy

– Workstealing (Cilk (1998, …) (2000, …) : recursive parallelism• Choice between sequential depth-first schedule and breadth-first schedule • « Work-first principle » : to optimize local sequentilal execution and put overhead on

rare steals from idle processors .• Implicitly adaptive

– Moldable tasks : Ordonnancement bi-critère avec garantie [Trystram&al 2004]

• Combinaison récursive alternatiive d’approximation pour chaque critère• Auto-adaptation avec performance garantie pour chaque critère

– Algorithmes « Cache-Oblivious » [Bender&al 2004]

• Découpe récursive par bloc qui minimise les défauts de page• Auto-adaptation à la hiérarchie mémoire (B-tree)

– Algorithmes « Processor-Oblivious » [Roch&al 2005]

• Combinaison récursive de 2 algorithmes séquentiel et parallèle• Auto-adaptation à l’inactivité des ressources

Some examples (2/2)

Best case : parallel algorithm is efficient

W is small and W1 = Wseq

The parallel algorithm is an optimal sequential oneExemples: parallel D&C algorithms

Implementation: work-first principle- no overhead when local execution of tasks

Examples :Cilk : THE protocol

Kaapi : Compare&swap only

Experimentation: knary benchmark

SMP ArchitectureOrigin 3800 (32 procs)

Cilk / Athapascan

Distributed Archi.iCluster

Athapascan

#procs Speed-Up

8 7,83

16 15,6

32 30,9

64 59,2

100 90,1

Ts = 2397 s T1 = 2435

In « practice »: coarse granularitySplitting into p = #resourcesDrawback : heterogeneous architecture, dynamic:

i(t) : speed of processor i at time t

In « theory »: fine granularity Maximal parallelismDrawback : overhead of tasks management

How to choose/adapt granularity ?

a b

H(a) O(b,7)

F(2,a) G(a,b) H(b)

High potentialdegree

of parallelism

How to obtain an efficientfine-grain algorithm ?

• Hypothesis for efficiency of work-stealing : • the parallel algorithm is « work-optimal » • T is very small (recursive parallelism)

• Problem :• Fine grain (T small) parallel algorithms may

involve a large overhead with respect to a sequential efficient algorithm: • Overhead due to parallelism creation and synchronization• But also arithmetic overhead

Self-grain Adaptive algorithms

• Recursive computations– Local sequential computation

• Special case: – recursive extraction of parallelism when a resource becomes idle– But local execution of a sequential algorithm

• Hypothesis : two algorithms : • - 1 sequential : SeqCompute• - 1 parallel : LastPartComputation => at any time, it is possible to

extract parallelism from the remaining computations of the sequential algorithm

• Example : – - iterated product [Vernizzi] - gzip / compression [Kerfali]– - MPEG-4 / H264 [Bernard ….] - prefix computation [Traore]

Adaptive Prefix versus optimalon identical processors

Illustration: adaptive parallel prefix

• Adaptive parallel computing on non-uniform and shared resources

• Example of adaptive prefix computation

• Sequential algorithm : for (i= 0 ; i <= n; i++ ) P[ i ] = P[ i – 1 ] * a [ i ] ; W1 = n

• Parallel algorithm [Ladner-Fischer]:

Indeed parallelism often costs ...

eg : Prefix computation P1 = a0*a1, P2=a0*a1*a2, …, Pn=a0*a1*…*an

a0 a1 a2 ana3

* * *

an-1

Prefix ( n / 2 )

P1 P3 Pn*

P4

*

P2

*

Pn-1

W =2. log n

but W1 = 2.n

Twice more expensive than the sequential

Documents

Mini Symposium Adaptive Algorithms for Scientific computing 9h45 Adaptive algorithms - Theory and applications Jean-Louis Roch &al. AHA Team INRIA-CNRS