Programming Paradigms and Algorithms

CSE 160/Berman

Programming Paradigms and Algorithms

W+A 3.1, 3.2, p. 178, 5.1, 5.3.3, Chapter 6, 9.2.8, 10.4.1,

Kumar 12.1.3

1. Berman, F., Wolski, R., Figueira, S., Schopf, J. and Shao, G., "Application-Level Scheduling on Distributed Heterogeneous Networks,"

Proceedings of Supercomputing '96

(http:apples.ucsd.edu)

CSE 160/Berman

Common Parallel Programming Paradigms

• Embarrassingly parallel programs

• Workqueue

• Master/Slave programs

• Monte Carlo methods

• Regular, Iterative (Stencil) Computations

• Pipelined Computations

• Synchronous Computations

CSE 160/Berman

Regular, Iterative Stencil Applications

• Many scientific applications have the formatLoop until some condition is true

Perform computation which involvescommunicating with N,E,W,S neighborsof a point (5 point stencil)

[Convergence test?]

CSE 160/Berman

Stencil Example: Jacobi2D• Jacobi algorithm, also known as the method of

simultaneous corrections is an iterative method for approximating the solution to a system of linear equations.

• Jacobi addresses the problem of solving n linear equations in n unknowns Ax=b where the ith equation is or alternatively

• a’s and b’s are known, want to solve for x’s

inniii bxaxaxa ,11,11,

ijjjii

iii xab

ax ,

,

1

CSE 160/Berman

Jacobi 2D Strategy• Jacobi strategy iterates until the computation

converges to an exact solution, i.e. each iteration we solve

where the values from the (k-1)st iteration are used to compute the values for the kth iteration

• For important classes of problems, Jacobi converges to a “good” solution after O(logN) iterations [Leighton]– typically, the solution is approximated to a desired error

threshold

ij

jk

jiiii

ik xab

ax 1

,,

1

CSE 160/Berman

Jacobi 2D

• Equation is most efficient to solve when most a’s are 0

• When most a’s entries are non-zero, A is dense• When most a’s are 0, A is sparse

– Sparse matrices are regularly found in many scientific applications.

ij

jk

jiiii

ik xab

ax 1

,,

1

nnn

n

aa

aa

A

,1,

,11,1

CSE 160/Berman

La Place’s Equation• Jacobi strategy can be used effectively to solve sparse linear

equations.• One such equation is La Place’s equation:

• f is solved over a 2D space having coordinates x and y• If the distance between points () is small enough, f can be

approximated by

• These equations reduce to

02

2

2

2

y

f

x

f

),(),(2),(1

),(),(2),(1

22

2

22

2

yxfyxfyxfy

f

yxfyxfyxfx

f

4

),(),(),(),(),(

yxfyxfyxfyxfyxf

CSE 160/Berman

La Place’s Equation• Note the relationship between the parameters

• This forms a 4 point stencil

Any update will involve only local communication!

4

),(),(),(),(),(

yxfyxfyxfyxfyxf

(x,y)(x+,y)(x-,y)

(x,y+)

(x,y-)

CSE 160/Berman

Solving La Place using Jacobi strategy

• Note that in La Place equation, we want to solve for all f(x,y) which has 2 parameters

• In Jacobi, we want to solve for x_i which has only 1 index• How do we convert f(x,y) into x_i ?• Associate x_i’s with the f(x,y)’s by distributing them in the f 2D

matrix in row-major (natural) order• For an nxn matrix, there are then nxn x_i’s, so the A matrix

will need to be (nxn)X(nxn)

987

654

321

xxx

xxx

xxx

CSE 160/Berman

Solving La Place using Jacobi strategy

• When the x_i’s are distributed in the f 2D matrix in row-major (natural) order

becomes

987

654

321

xxx

xxx

xxx

4

11 niiinii

xxxxx

4

),(),(),(),(),(

yxfyxfyxfyxfyxf

CSE 160/Berman

Working backward• Now we want to work backward to find out what

the A matrix and b vector will be for Jacobi• Our solution to the La Place equation gives us

equations of this form

• Rewriting, we get

• So the b_i are 0, what is the A matrix?

4

11 niiinii

xxxxx

04 11 niiiini xxxxx

CSE 160/Berman

Finding the A matrix• Each row only at most 5 non-zero entries• All entries on the diagonal are 4

0

01

,1,

,11,1

nnnn

n

x

x

aa

aa

410100000

141010000

014101000

101410100

010141010

001014101

000101410

000010141

000001014

A

N=9, n=3:

CSE 160/Berman

Jacobi Implementation Strategy• An initial guess is made for all the unknowns, typically x_i =

b_i• New values for the x_i’s are calculated using the iteration

equations

• The updated values are substituted in the iteration equations and the process repeats again

• The user provides a "termination condition" to end the iteration. – An example termination condition is error threshold.

4

11

11

11ni

ti

ti

tni

t

it xxxx

x

i

ti

t xx 1

CSE 160/Berman

Data Parallel Jacobi 2D Pseudo-code[Initialize ghost regions]for (i=1; i<=N; i++)

x[0][i] = north[i];x[N+1][i] = south[i];x[i][0] = west[i];x[i][N+1] = east[i];

[Initialize matrix]for (i=1; i<=N; i++)

for (j=1; j<=N; j++)x[i][j] = initvalue;

[Iterative refinement of x until values converge]while (maxdiff > CONVERG)

[Update x array]for (i=1; i<=N; i++)

for (j=1; j<=N; j++)newx[i][j] = ¼ (x[i-1][j] + x[i][j+1] + x[i+1][j] + x[i][j-1]);

[Convergence test]maxdiff = 0;for (i=1; i<=N; i++)

for (j=1; j<=N; j++)maxdiff = max(maxdiff, |newx[i][j]-x[i][j]|);x[i][j] = newx[i][j];

CSE 160/Berman

Jacobi2D Programming Issues

• Synchronization– Should we synchronize between iterations?

Between multiple iterations?– Should we tag information and let the application

run asynchronously? (How bad can things get?)

• How often should we test for convergence?– How important is it to know when we’re done?– How expensive is it?

CSE 160/Berman

Jacobi2D Programming Issues• Block decomposition or strip decomposition?

– How big should the blocks or strips be?

• How should blocks/strips be allocated to processors?

Block Uniform Strip Non-uniform Strip

CSE 160/Berman

• 1D (Processors P0 P1 P2 P3 , tasks 0-15)– Block decomposition (Task i allocated to processor floor (i/p))

– Cyclic decomposition (Task i allocated to processor i mod p)

– Block-Cycle Decomposition (Block i allocated to processor i mod p)

HPF-Style Data Decompositions

Block

Cyclic

Block-cyclic

CSE 160/Berman

HPF-Style Data Decompositions• 2D

– Each dimension partitioned by block, cyclic, block-cyclic or * (do nothing)

– Useful set of uniform decompositions can be constructed

[Block, Block] [Block, *] [* , Cyclic]

CSE 160/Berman

Jacobi on a Cluster

• If each partition of Jacobi is executed on a processor in a lab cluster, we can no longer assume we have dedicated processors and network

• In particular, the performance exhibited by the cluster will vary over time and with load

• How can we go about developing a performance-efficient implementation in a more dynamic environment?

CSE 160/Berman

Jacobi AppLeS

• We developed an AppLeS application scheduler

• AppLeS = Application-Level Scheduler• AppLeS is scheduling agent which

integrates with application to form a “Grid-aware” adaptive self-scheduling application

• Targeted Jacobi AppLeS to a distributed clustered environment

How Does AppLeS Work?

Grid InfrastructureNWS

Schedule Deployment

Resource Discovery

Resource Selection

SchedulePlanning

and Performance

Modeling

DecisionModel

accessible resources

feasible resource sets

evaluatedschedules

“best” schedule

AppLeS + application

= self-scheduling application

Resources

Network Weather Service (Wolski, U. Tenn.)

• The NWS provides dynamic resource information for AppLeS

• NWS is stand-alone system

• NWS – monitors current system state

– provides best forecast of resource load from multiple models

Sensor Interface

Reporting Interface

Forecaster

Model 2 Model 3Model 1

Fast Ethernet Bandwidth at SDSC

0

10

20

30

40

50

60

70

Time of Day

Meg

abits

per

Sec

ond

Measurements

Exponential SmoothingPredictions

Tue Wed Thu Fri Sat Sun Mon Tue

Jacobi2D AppLeS Resource Selector

• Feasible resources determined according to application-specific “distance” metric

– Choose fastest machine as locus

– Compute distance D from locus based on unit-sized application-specific benchmark

D[locus,X] = |comp[unit,locus]-comp[unit,X]| + comm[W,E columns]

• Resources sorted according to distance from locus, forming a desirability list– Feasible resource sets formed from initial subsets of sorted desirability

list

– Next step: plan a schedule for each feasible resource set

– Scheduler will choose schedule with best predicted execution time

Execution time for ith strip

where load = predicted percentage of CPU time available (NWS)comm = time to send and receive messages factored by

predicted BW (NWS)

AppLeS uses time-balancing to determine best partition on a given set of resources

Solve

for

loadipt

OperComp

CommCompAreaT

unloadedi

iiii

))((

Jacobi2D Performance Model and Schedule Planning

p T T T 2 1

}{ iArea

P1 P2 P3

Jacobi2D Experiments• Experiments compare

– Compile-time block [HPF] partitioning

– Compile-time irregular strip partitioning [no NWS forecasts, no resource selection]

– Run-time strip AppLeS partitioning

• Runs for different partitioning methods performed back-to-back on production systems

• Average execution time recorded

• Distributed UCSD/SDSC platform: Sparcs, RS6000, Alpha Farm, SP-2

Jacobi2D AppLeS Experiments• Representative Jacobi 2D AppLeS

experiment

• Adaptive scheduling leverages deliverable performance of contended system

• Spike occurs when a gateway between PCL and SDSC goes down

• Subsequent AppLeS experiments avoid slow link

0

1

2

3

4

5

6

7

Execu

tion T

ime (

secon

ds)

1000

1100

1200

1300

1400

1500

1600

1700

1800

1900

2000

Problem Size

Comparison of Execution Times

Compile-time Blocked

Compile-time Irregular Strip

Runtime

Documents

Programming Paradigms and Algorithms