Upload
reya
View
53
Download
4
Tags:
Embed Size (px)
DESCRIPTION
Programming Paradigms and Algorithms. W+A 3.1, 3.2, p. 178, 5.1, 5.3.3, Chapter 6, 9.2.8, 10.4.1, Kumar 12.1.3 1. Berman, F., Wolski, R., Figueira, S., Schopf, J. and Shao, G., "Application-Level Scheduling on Distributed Heterogeneous Networks," Proceedings of Supercomputing '96 - PowerPoint PPT Presentation
Citation preview
CSE 160/Berman
Programming Paradigms and Algorithms
W+A 3.1, 3.2, p. 178, 5.1, 5.3.3, Chapter 6, 9.2.8, 10.4.1,
Kumar 12.1.3
1. Berman, F., Wolski, R., Figueira, S., Schopf, J. and Shao, G., "Application-Level Scheduling on Distributed Heterogeneous Networks,"
Proceedings of Supercomputing '96
(http:apples.ucsd.edu)
CSE 160/Berman
Common Parallel Programming Paradigms
• Embarrassingly parallel programs
• Workqueue
• Master/Slave programs
• Monte Carlo methods
• Regular, Iterative (Stencil) Computations
• Pipelined Computations
• Synchronous Computations
CSE 160/Berman
Regular, Iterative Stencil Applications
• Many scientific applications have the formatLoop until some condition is true
Perform computation which involvescommunicating with N,E,W,S neighborsof a point (5 point stencil)
[Convergence test?]
CSE 160/Berman
Stencil Example: Jacobi2D• Jacobi algorithm, also known as the method of
simultaneous corrections is an iterative method for approximating the solution to a system of linear equations.
• Jacobi addresses the problem of solving n linear equations in n unknowns Ax=b where the ith equation is or alternatively
• a’s and b’s are known, want to solve for x’s
inniii bxaxaxa ,11,11,
ijjjii
iii xab
ax ,
,
1
CSE 160/Berman
Jacobi 2D Strategy• Jacobi strategy iterates until the computation
converges to an exact solution, i.e. each iteration we solve
where the values from the (k-1)st iteration are used to compute the values for the kth iteration
• For important classes of problems, Jacobi converges to a “good” solution after O(logN) iterations [Leighton]– typically, the solution is approximated to a desired error
threshold
ij
jk
jiiii
ik xab
ax 1
,,
1
CSE 160/Berman
Jacobi 2D
• Equation is most efficient to solve when most a’s are 0
• When most a’s entries are non-zero, A is dense• When most a’s are 0, A is sparse
– Sparse matrices are regularly found in many scientific applications.
ij
jk
jiiii
ik xab
ax 1
,,
1
nnn
n
aa
aa
A
,1,
,11,1
CSE 160/Berman
La Place’s Equation• Jacobi strategy can be used effectively to solve sparse linear
equations.• One such equation is La Place’s equation:
• f is solved over a 2D space having coordinates x and y• If the distance between points () is small enough, f can be
approximated by
• These equations reduce to
02
2
2
2
y
f
x
f
),(),(2),(1
),(),(2),(1
22
2
22
2
yxfyxfyxfy
f
yxfyxfyxfx
f
4
),(),(),(),(),(
yxfyxfyxfyxfyxf
CSE 160/Berman
La Place’s Equation• Note the relationship between the parameters
• This forms a 4 point stencil
Any update will involve only local communication!
4
),(),(),(),(),(
yxfyxfyxfyxfyxf
(x,y)(x+,y)(x-,y)
(x,y+)
(x,y-)
CSE 160/Berman
Solving La Place using Jacobi strategy
• Note that in La Place equation, we want to solve for all f(x,y) which has 2 parameters
• In Jacobi, we want to solve for x_i which has only 1 index• How do we convert f(x,y) into x_i ?• Associate x_i’s with the f(x,y)’s by distributing them in the f 2D
matrix in row-major (natural) order• For an nxn matrix, there are then nxn x_i’s, so the A matrix
will need to be (nxn)X(nxn)
987
654
321
xxx
xxx
xxx
CSE 160/Berman
Solving La Place using Jacobi strategy
• When the x_i’s are distributed in the f 2D matrix in row-major (natural) order
becomes
987
654
321
xxx
xxx
xxx
4
11 niiinii
xxxxx
4
),(),(),(),(),(
yxfyxfyxfyxfyxf
CSE 160/Berman
Working backward• Now we want to work backward to find out what
the A matrix and b vector will be for Jacobi• Our solution to the La Place equation gives us
equations of this form
• Rewriting, we get
• So the b_i are 0, what is the A matrix?
4
11 niiinii
xxxxx
04 11 niiiini xxxxx
CSE 160/Berman
Finding the A matrix• Each row only at most 5 non-zero entries• All entries on the diagonal are 4
0
01
,1,
,11,1
nnnn
n
x
x
aa
aa
410100000
141010000
014101000
101410100
010141010
001014101
000101410
000010141
000001014
A
N=9, n=3:
CSE 160/Berman
Jacobi Implementation Strategy• An initial guess is made for all the unknowns, typically x_i =
b_i• New values for the x_i’s are calculated using the iteration
equations
• The updated values are substituted in the iteration equations and the process repeats again
• The user provides a "termination condition" to end the iteration. – An example termination condition is error threshold.
4
11
11
11ni
ti
ti
tni
t
it xxxx
x
i
ti
t xx 1
CSE 160/Berman
Data Parallel Jacobi 2D Pseudo-code[Initialize ghost regions]for (i=1; i<=N; i++)
x[0][i] = north[i];x[N+1][i] = south[i];x[i][0] = west[i];x[i][N+1] = east[i];
[Initialize matrix]for (i=1; i<=N; i++)
for (j=1; j<=N; j++)x[i][j] = initvalue;
[Iterative refinement of x until values converge]while (maxdiff > CONVERG)
[Update x array]for (i=1; i<=N; i++)
for (j=1; j<=N; j++)newx[i][j] = ¼ (x[i-1][j] + x[i][j+1] + x[i+1][j] + x[i][j-1]);
[Convergence test]maxdiff = 0;for (i=1; i<=N; i++)
for (j=1; j<=N; j++)maxdiff = max(maxdiff, |newx[i][j]-x[i][j]|);x[i][j] = newx[i][j];
CSE 160/Berman
Jacobi2D Programming Issues
• Synchronization– Should we synchronize between iterations?
Between multiple iterations?– Should we tag information and let the application
run asynchronously? (How bad can things get?)
• How often should we test for convergence?– How important is it to know when we’re done?– How expensive is it?
CSE 160/Berman
Jacobi2D Programming Issues• Block decomposition or strip decomposition?
– How big should the blocks or strips be?
• How should blocks/strips be allocated to processors?
Block Uniform Strip Non-uniform Strip
CSE 160/Berman
• 1D (Processors P0 P1 P2 P3 , tasks 0-15)– Block decomposition (Task i allocated to processor floor (i/p))
– Cyclic decomposition (Task i allocated to processor i mod p)
– Block-Cycle Decomposition (Block i allocated to processor i mod p)
HPF-Style Data Decompositions
Block
Cyclic
Block-cyclic
CSE 160/Berman
HPF-Style Data Decompositions• 2D
– Each dimension partitioned by block, cyclic, block-cyclic or * (do nothing)
– Useful set of uniform decompositions can be constructed
[Block, Block] [Block, *] [* , Cyclic]
CSE 160/Berman
Jacobi on a Cluster
• If each partition of Jacobi is executed on a processor in a lab cluster, we can no longer assume we have dedicated processors and network
• In particular, the performance exhibited by the cluster will vary over time and with load
• How can we go about developing a performance-efficient implementation in a more dynamic environment?
CSE 160/Berman
Jacobi AppLeS
• We developed an AppLeS application scheduler
• AppLeS = Application-Level Scheduler• AppLeS is scheduling agent which
integrates with application to form a “Grid-aware” adaptive self-scheduling application
• Targeted Jacobi AppLeS to a distributed clustered environment
How Does AppLeS Work?
Grid InfrastructureNWS
Schedule Deployment
Resource Discovery
Resource Selection
SchedulePlanning
and Performance
Modeling
DecisionModel
accessible resources
feasible resource sets
evaluatedschedules
“best” schedule
AppLeS + application
= self-scheduling application
Resources
Network Weather Service (Wolski, U. Tenn.)
• The NWS provides dynamic resource information for AppLeS
• NWS is stand-alone system
• NWS – monitors current system state
– provides best forecast of resource load from multiple models
Sensor Interface
Reporting Interface
Forecaster
Model 2 Model 3Model 1
Fast Ethernet Bandwidth at SDSC
0
10
20
30
40
50
60
70
Time of Day
Meg
abits
per
Sec
ond
Measurements
Exponential SmoothingPredictions
Tue Wed Thu Fri Sat Sun Mon Tue
Jacobi2D AppLeS Resource Selector
• Feasible resources determined according to application-specific “distance” metric
– Choose fastest machine as locus
– Compute distance D from locus based on unit-sized application-specific benchmark
D[locus,X] = |comp[unit,locus]-comp[unit,X]| + comm[W,E columns]
• Resources sorted according to distance from locus, forming a desirability list– Feasible resource sets formed from initial subsets of sorted desirability
list
– Next step: plan a schedule for each feasible resource set
– Scheduler will choose schedule with best predicted execution time
Execution time for ith strip
where load = predicted percentage of CPU time available (NWS)comm = time to send and receive messages factored by
predicted BW (NWS)
AppLeS uses time-balancing to determine best partition on a given set of resources
Solve
for
loadipt
OperComp
CommCompAreaT
unloadedi
iiii
))((
Jacobi2D Performance Model and Schedule Planning
p T T T 2 1
}{ iArea
P1 P2 P3
Jacobi2D Experiments• Experiments compare
– Compile-time block [HPF] partitioning
– Compile-time irregular strip partitioning [no NWS forecasts, no resource selection]
– Run-time strip AppLeS partitioning
• Runs for different partitioning methods performed back-to-back on production systems
• Average execution time recorded
• Distributed UCSD/SDSC platform: Sparcs, RS6000, Alpha Farm, SP-2
Jacobi2D AppLeS Experiments• Representative Jacobi 2D AppLeS
experiment
• Adaptive scheduling leverages deliverable performance of contended system
• Spike occurs when a gateway between PCL and SDSC goes down
• Subsequent AppLeS experiments avoid slow link
0
1
2
3
4
5
6
7
Execu
tion T
ime (
secon
ds)
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2000
Problem Size
Comparison of Execution Times
Compile-time Blocked
Compile-time Irregular Strip
Runtime