Upload
jeremy-york
View
226
Download
2
Tags:
Embed Size (px)
Citation preview
Parallelism
Students:
Deaconescu Ionut
Albu Alexandru
Why need Parallelism?
Faster, of course Finish the work earlier
Same work in less time Do more work
More work in the same time
Source: https://computing.llnl.gov/tutorials/parallel_comp/http://www.mcs.anl.gov/~itf/dbpp/text/node9.html
How to Parallelize an Application?
Break down the computational part into small pieces
Assign the small jobs to the parallel running processes
May become complicated when the small piece of jobs depend upon others
Easy Case: Parameter Set
You are running experiments to support your claims and/or better understand a problem Experiment here means an application that you are
interesting in the results by running it with different input parameters
The pieces of computation are the same program with different parameters
Each piece is independent from each other
Source: https://computing.llnl.gov/tutorials/parallel_comp/http://www.mcs.anl.gov/~itf/dbpp/text/node9.html
Parameter Set using Scripts
Your experiment should be able to run in batch Read all parameters (and other inputs) from the
command line and files Write all output to a file (whose name you can specify
as an input) Use ssh to start the experiment in many machines If there is no common file system, use scp to stage
the inputs and collect the results Use nice
Source: https://computing.llnl.gov/tutorials/parallel_comp/
Parameter Set via TDG Cluster
A simple script that uses ssh to start experiments in many machines will save you a lot of time
However, it is possible to do better by carefully considering resource selection, work distribution, input staging, output collection, and the like
That is, scheduling can really help in this scenario, using PBS
Source: https://computing.llnl.gov/tutorials/parallel_comp/
Hard Case: Dependent Pieces of Computation
If you are running one huge simulation the pieces of computation are not
independent anymore The processes that form the application will
have to communicate these dependencies
Source: https://computing.llnl.gov/tutorials/parallel_comp/http://www.buyya.com/cluster/v2chap1.pdf
Hard Case: Dependent Pieces of Computation
Think how to break the application apart in parallel-running processes
Consider carefully if parallelizing your application is really worth Parallelize it only if your application really
takes too much to run and is going to be used many times
Source: https://computing.llnl.gov/tutorials/parallel_comp/
Programming Alternatives
Shared Memory Does not scale that well
Message Passing Sockets
too low-level Usually parallel applications are not client-
server MPI (Message Passing Interface) is the
standard API to do this
Source: https://computing.llnl.gov/tutorials/parallel_comp/
Steps for Writing Parallel Program
If you are starting with an existing serial program, debug the serial code completely
Identify which parts of the program can be executed concurrently: Requires a thorough understanding of the algorithm Exploit any parallelism which may exist May require restructuring of the program and/or algorithm.
May require an entirely new algorithm. Decompose the program:
Functional Parallelism Data Parallelism Combination of both
Source: https://computing.llnl.gov/tutorials/parallel_comp/
Steps for Writing Parallel Program
Code development Code may be influenced/determined by machine
architecture Choose a programming paradigm Determine communication Add code to accomplish process control and
communications Compile, Test, Debug Optimization
Measure Performance Locate Problem Areas Improve them
Source: https://computing.llnl.gov/tutorials/parallel_comp/
Program Decomposition
There are three methods for decomposing a problem into smaller processes to be performed in parallel: Functional Decomposition, Domain Decomposition, or a combination of both
Source: https://computing.llnl.gov/tutorials/parallel_comp/
Functional Decomposition (Functional Parallelism)
Decomposing the problem into different processes which can be distributed to multiple processors for simultaneous execution
Good to use when there is not static structure or fixed determination of number of calculations to be performed
Source: https://computing.llnl.gov/tutorials/parallel_comp/http://www.buyya.com/cluster/v2chap1.pdf
Functional Decomposition (Functional Parallelism)
Machine 1 Machine 2 Machine 3 Machine 4
The Problem
Source: https://computing.llnl.gov/tutorials/parallel_comp/
Domain Decomposition (Data Parallelism)
Partitioning the problem's data domain and distributing portions to multiple processors for simultaneous execution
Good to use for problems where: data is static (factoring and solving large matrix or
finite difference calculations) dynamic data structure tied to single entity where
entity can be subset (large multi-body problems) domain is fixed but computation within various
regions of the domain is dynamic (fluid vortices models)
Source: https://computing.llnl.gov/tutorials/parallel_comp/
Domain Decomposition (Data Parallelism)
Machine 1 Machine 2 Machine 3 Machine 4
The Problem
Source: https://computing.llnl.gov/tutorials/parallel_comp/
Other Decomposition Methods – One Dimensional Data Distribution
Block Distribution Cyclic Distribution
Source: https://computing.llnl.gov/tutorials/parallel_comp/
Other Decomposition Methods – Two Dimensional Data Distribution
Block Block Distribution
Source: https://computing.llnl.gov/tutorials/parallel_comp/
Other Decomposition Methods – Two Dimensional Data Distribution
Block Cyclic Distribution
Source: https://computing.llnl.gov/tutorials/parallel_comp/
Other Decomposition Methods – Two Dimensional Data Distribution
Cyclic Block Distribution
Source: https://computing.llnl.gov/tutorials/parallel_comp/
Programming
Understanding the inter-processor communications of your program is essential
Message Passing communication is programmed explicitly. The programmer must understand and code the communication
Data Parallel compilers and run-time systems do all communications behind the scenes. The programmer need not understand the underlying communications. On the other hand to get good performance from your code you should write your algorithm with the best communication possible
Source: https://computing.llnl.gov/tutorials/parallel_comp/
Considerations: Amdahl's Law
It states that potential program speedup is defined by the fraction of code (f) which can be parallelized
If none of the code can be parallelized, f = 0 and the speedup = 1 (no speedup). If all of the code is parallelized, f = 1 and the speedup is infinite (in theory)
fspeedup
1
1
Source: https://computing.llnl.gov/tutorials/parallel_comp/
Considerations: Amdahl's Law
Introducing the number of processors performing the parallel fraction of work, the relationship can be modeled by the equation where:
P: parallel fraction N: number of processors S: serial fraction
SNP
speedup
1
Source: https://computing.llnl.gov/tutorials/parallel_comp/
Considerations: Amdahl's Law
It is obvious that there are limits to the scalability of parallelism. For example, at P = .50, .90 and .99 (50%, 90% and 99% of the code is parallelizable)
Speedup
N P=0.50 P=0.90 P=0.99
10 1.82 5.26 9.17
100 1.98 9.17 50.25
1000 1.998 9.91 90.99
10000 1.9998 9.991 99.02
Considerations: Amdahl's Law
Problems which increase the percentage of parallel time with their size are more "scalable" than problems with a fixed percentage of parallel time
Source: https://computing.llnl.gov/tutorials/parallel_comp/
Considerations: Load Balancing
Load balancing refers to the ways to distribute processes so as to insure the most time efficient parallel execution
If processes are not distributed in a balanced way, some processes are waiting while other processes are idle
Performance can be increased if work can be more evenly distributed
For example, if there are many processes of varying sizes, it may be more efficient to maintain a process pool and distribute to processors as each finishes
Consider a heterogeneous environment where there are machines of widely varying power and user load versus a homogeneous environment with identical processors running one job per processor
Source: https://computing.llnl.gov/tutorials/parallel_comp/
Considerations: Granularity
In order to coordinate between different processors working on the same problem, some form of communication between them is required
The ratio between computation and communication is known as granularity
The most efficient granularity is dependent on the algorithm and the hardware environment in which it runs
In most cases overhead associated with communications and synchronization is high relative to execution speed so it is advantageous to have coarse granularity
Source: https://computing.llnl.gov/tutorials/parallel_comp/http://www.buyya.com/cluster/v2chap1.pdf
Fine-grain Parallelism
All processes execute a small number of instructions between communication cycles
Facilitates load balancing Low computation to communication ratio Implies high communication overhead and less
opportunity for performance enhancement If granularity is too fine it is possible that the overhead
required for communications and synchronization between processes takes longer than the computation
Source: https://computing.llnl.gov/tutorials/parallel_comp/
Fine-grain Parallelism
Computation Computation Computation
Communication Communication Communication
Computation Computation Computation
Communication Communication Communication
… … …
Computation Computation Computation
Source: https://computing.llnl.gov/tutorials/parallel_comp/
Coarse-grain Parallelism
Typified by long computations consisting of large numbers of instructions between communication synchronization points
High computation to communication ratio Implies more opportunity for performance
increase Harder to load balance efficiently
Imagine that the computation work load is a 10 kg. of material:
Sand = fine-grain Cinder blocks = coarse grain
Which is easier to distribute?
Source: https://computing.llnl.gov/tutorials/parallel_comp/
Coarse-grain Parallelism
Computation Computation Computation
Communication Communication Communication
Computation Computation Computation
Communication Communication Communication
… … …
Considerations: Data Dependency
Data dependency exists when there is multiple use of the same storage location
Types of data dependencies Flow Dependent: Process 2 uses a variable computed by
Process 1. Process 1 must store/send the variable before Process 2 fetches
Output Dependent: Process 1 and Process 2 both compute the same variable and Process 2's value must be stored/sent after Process 1's
Control Dependent: Process 2's execution depends upon a conditional statement in Process 1. Process 1 must complete before a decision can be made about executing Process 2
Source: https://computing.llnl.gov/tutorials/parallel_comp/
Considerations: Data Dependency
How to handle data dependencies? Distributed memory
Communicate required data at synchronization points
Shared memory Synchronize read/write operations between
processes
Source: https://computing.llnl.gov/tutorials/parallel_comp/
Considerations: Communication Patterns and Bandwidth
For some problems, increasing the number of processors will: Decrease the execution time attributable to computation But also, increase the execution time attributable to
communication Communication patterns also affect the computation to
communication ratio. For example, gather-scatter communications between a
single processor and N other processors will be impacted more by an increase in latency than N processors communicating only with nearest neighbors They have to wait until all have reached a certain point
Source: https://computing.llnl.gov/tutorials/parallel_comp/
Considerations: I/O Operation
I/O operations are generally regarded as inhibitors to parallelism
In an environment where all processors see the same file space, write operations will result in file overwriting
Read operations will be affected by the fileserver's ability to handle multiple read requests at the same time
I/O which must be conducted over the network (non-local) can cause severe bottlenecks
Source: https://computing.llnl.gov/tutorials/parallel_comp/
Considerations: I/O Operation
Some alternatives: Reduce overall I/O as much as possible Confine I/O to specific serial portions of the job For example, process 0 could read an input file and then
communicate required data to other processes. Likewise, process 1 could perform write operation after receiving required data from all other processes.
Create unique filenames for each processes' input/output file(s) For distributed memory systems with shared file space, perform
I/O in local, non-shared file space For example, each processor may have /tmp filespace which can
used. This is usually much more efficient than performing I/O over the network to one's home directory
Considerations: Fault Tolerance and Restarting
In parallel programming, it is usually the programmer's responsibility to handle events such as: machine failures task failures checkpoint restarting
Source: https://computing.llnl.gov/tutorials/parallel_comp/http://www.buyya.com/cluster/v2chap1.pdf
Considerations: Deadlock
Deadlock describes a condition where two or more processes are waiting for an event or communication from one of the other processes.
The simplest example is demonstrated by two processes which are both programmed to read/receive from the other before writing/sending.
Process 1
X = 1Recv (Process 2, Y)Send (Process 2, X)Z=X+Y…
Process 2
Y = 10Recv (Process 1, X)Send (Process 1, Y)Z=X+Y…
Considerations: Debugging
Debugging parallel programs is significantly more of a challenge than debugging serial programs
Debug the program as soon as the development start
Use a modular approach to program development
Pay as close attention to communication details as to computation details
Source: https://computing.llnl.gov/tutorials/parallel_comp/
Essentials of Loop Parallelism
Problems that has a loop construct forms the main computational component of the code. Loops are a main target for parallelizing and vectorizing code. A program often spends much of its time in loops. When it can be done, parallelizing these sections of code can have dramatic benefits.
A step-wise refinement procedure for developing the parallel algorithms will be employed. An initial solution for each problem will be presented and improved by considering performance issues
Source: https://computing.llnl.gov/tutorials/parallel_comp/
Essentials of Loop Parallelism
Pseudo-code will be used to describe the solutions. The solutions will address the following issues: identification of parallelism program decomposition load balancing (static vs. dynamic) task granularity in the case of dynamic load balancing communication patterns - overlapping communication and
computation Note the difference in approaches between message
passing and data parallel programming. Message passing explicitly parallelizes the loops where data parallel replaces loops by working on entire arrays in parallel
Source: https://computing.llnl.gov/tutorials/parallel_comp/
Example: Calculation (Serial)
Problem is: Computationally intensive Minimal communication
The value of PI can be calculated in a number of ways, many of which are easily parallelized
Consider the following method of approximating PI Inscribe a circle in a square Randomly generate points in the square Determine the number of points in the square that are also in the
circle Let r be the number of points in the circle divided by the number of
points in the square PI ~ 4 r Note that the more points generated, the better the approximation
Source: https://computing.llnl.gov/tutorials/parallel_comp/
Example: Calculation (Serial)
square
circle
circle
square
A
A
rA
rrA
4
4)2(2
22
2r
Example: Calculation (Serial)
Serial pseudo code for this procedure: npoints = 10000 circle_count = 0 do j = 1,npoints
generate 2 random numbers between 0 and 1 xcoordinate = random1 ycoordinate = random2 if (xcoordinate, ycoordinate) inside circle
then circle_count = circle_count + 1 end do PI = 4.0*circle_count/npoints
Note that most of the time in running this program would be spent executing the loop
Source: https://computing.llnl.gov/tutorials/parallel_comp/
Example: Calculation (Parallel)
Parallel strategy: break the loop into portions which can be executed by the processors.
For the task of approximating PI: each processor executes its portion of the loop a number of
times each processor can do its work without requiring any
information from the other processors (there are no data dependencies). This situation is known as Embarrassingly Parallel
Use SPMD (Single Processor/Multiple Data) Model – One process acts as master and collects the results
Source: https://computing.llnl.gov/tutorials/parallel_comp/
Example: Calculation (Parallel)
Message passing pseudo code: npoints = 10000 circle_count = 0 p = number of processors num = npoints/p
find out if I am master or worker
do j = 1,num generate 2 random numbers between 0 and 1 xcoordinate = random1; ycoordinate = random2 if (xcoordinate, ycoordinate) inside circle
then circle_count = circle_count + 1 end do
if I am master receive from workers their circle_counts compute PI (use master and workers calculations)
else if I am worker send to master circle_count
endif
Source: https://computing.llnl.gov/tutorials/parallel_comp/
Example: Calculation (Parallel)
Data parallel solution: The data parallel solutions processes entire arrays at
the same time. No looping is used. Arrays automatically distributed to processors. All
message passing is done behind the scenes. In data parallel, one node, a sort of master, usually holds all scalar values. The SUM function does a reduction and leaves the value in a scalar variable.
A temporary array, COUNTER, with the same size as RANDOM is created for the sum operation
Source: https://computing.llnl.gov/tutorials/parallel_comp/
Example: Calculation (Parallel)
Data parallel pseudo code: fill RANDOM with 2 random numbers between 0 and 1
where (the values of RANDOM are inside the circle) COUNTER = 1
else where COUNTER = 0
end where
circle_count = sum (COUNTER) PI = 4.0*circle_count/npoints
Source: https://computing.llnl.gov/tutorials/parallel_comp/
Example:Array Elements Calculation (Serial)
This example shows calculations on array elements that require very little communication.
Elements of 2-dimensional array are calculated. The calculation of elements is independent of one another -
leads to embarrassingly parallel situation. The problem should be computation intensive. Serial code could be of the form:
do j = 1,n do i = 1,n
a(i,j) = fcn(i,j) end do
end do The serial program calculates one element at a time in the
specified order
Source: https://computing.llnl.gov/tutorials/parallel_comp/
Example:Array Elements Calculation (Parallel)
Message Passing Arrays are distributed so that each processor owns a portion of an
array. Independent calculation of array elements insures no
communication amongst processors is needed. Distribution scheme is chosen by other criteria, e.g. unit stride
through arrays. Desirable to have unit stride through arrays, then the choice of a
distribution scheme depends on the programming language. Fortran: block cyclic distribution C: cyclic block distribution
After the array is distributed, each processor executes the portion of the loop corresponding to the data it owns.
Notice only the loop variables are different from the serial solution
Source: https://computing.llnl.gov/tutorials/parallel_comp/
Example:Array Elements Calculation (Parallel)
For example, with Fortran and a block cyclic distribution: do j = mystart, myend
do i = 1,n a(i,j) = fcn(i,j)
end do end do
Message Passing Solution: With Fortran storage scheme, perform block cyclic
distribution of array. Implement as SPMD model. Master process initializes array, sends info to worker
processes and receives results. Worker process receives info, performs its share of
computation and sends results to master.
Source: https://computing.llnl.gov/tutorials/parallel_comp/
Example:Array Elements Calculation (Parallel)
Message Passing Pseudo code: find out if I am master or worker if I am master
initialize the array send each worker info on part of array it owns send each worker its portion of initial array receive from each worker results
else if I am worker receive from master info on part of array I own receive from master my portion of initial array
# calculate my portion of array do j = my first column,my last column
do i = 1,n a(i,j) = fcn(i,j)
end do end do send master results
endif
Source: https://computing.llnl.gov/tutorials/parallel_comp/
Example:Array Elements Calculation (Parallel)
Data Parallel A trivial problem for a data parallel language. Data parallel languages often have compiler
directives to do data distribution. Loops are replaced by a "for all elements" construct
which performs the operation in parallel. Good example of ease in programming versus
message passing. Pseudo code solution:
DISTRIBUTE a (block, cyclic) for all elements (i,j)
a(i,j) = fcn (i,j)
Source: https://computing.llnl.gov/tutorials/parallel_comp/
Example: Array Elements Calculation (Dynamic Load Balancing)
We've looked at problems that are static load balanced. each processor has fixed amount of work to do may be significant idle time for faster or more lightly loaded processors.
Usually is not a major concern with dedicated usage. i.e. load leveler. If you have a load balance problem, you can use a “dynamic load balancing"
scheme. This solution only available in message passing. Two processes are employed:
Master Process: holds pool of tasks for worker processes to do sends worker a task when requested collects results from workers
Worker Process: repeatedly does the following gets task from master process performs computation sends results to master
Worker processes do not know before runtime which portion of array they will handle or how many tasks they will perform.
The fastest process will get more tasks to do.
Example: Array Elements Calculation (Dynamic Load Balancing)
Solution: Calculate an array element Worker process gets task from master, performs work, sends results to
master, and gets next task Pseudo code solution:
find out if I am master or worker if I am master
do until no more jobs send to worker next job receive results from worker
end do tell workers no more jobs
else if I am worker do until no more jobs
receive from master next job calculate array element: a(i,j) = fcn(i,j) send results to master
end do endif
Source: https://computing.llnl.gov/tutorials/parallel_comp/
Example: Array Elements Calculation (Dynamic Load Balancing)
Static load balancing can result in significant idle time for faster processors.
Dynamic load balancing offers a potential solution - the faster processors do more work.
In the dynamic load balancing solution, the workers calculated array elements, resulting in:
optimal load balancing: all processors complete work at the same time fine granularity: small unit of computation, master and worker communicate
after every element fine granularity may cause very high communications cost
Alternate Parallel Solution: give processors more work - columns or rows rather than elements more computation and less communication results in larger granularity reduced communication may improve performance
Source: https://computing.llnl.gov/tutorials/parallel_comp/
Example: Simple Heat Equation (Serial)
Most problems in parallel computing require communication among the processors.
Common problem requires communication with "neighbor" processor. The heat equation describes the temperature change over time, given
initial temperature distribution and boundary conditions. A finite differencing scheme is employed to solve the heat equation
numerically on a square region. The initial temperature is zero on the boundaries and high in the
middle. The boundary temperature is held at zero. For the fully explicit problem, a time stepping algorithm is used. The
elements of a 2-dimensional array represent the temperature at points on the square
Source: https://computing.llnl.gov/tutorials/parallel_comp/
Example: Simple Heat Equation (Serial)
Example: Simple Heat Equation (Serial)
Ux, y+1
Ux, y
Ux+1, y
Ux, y-1
Ux-1, y
)2()2( ,1,1,,,1,1,, yxyxyxyyxyxyxxyxyx UUUCUUUCUU
Source: https://computing.llnl.gov/tutorials/parallel_comp/http://www.mcs.anl.gov/~itf/dbpp/text/node9.html
Example: Simple Heat Equation (Serial)
The calculation of an element is dependent on neighbor element values.
A serial program would contain code like do iy = 2, ny - 1
do ix = 2, nx - 1 u2(ix, iy) =
u1(ix, iy) + cx * (u1(ix+1,iy) + u1(ix-1,iy) - 2.*u1(ix,iy)) + cy * (u1(ix,iy+1) + u1(ix,iy-1) - 2.*u1(ix,iy))
end do end do
Example: Simple Heat Equation (Parallel)
Arrays are distributed so that each processor owns a portion of the arrays.
Determine data dependencies interior elements belonging to a processor are independent of
other processors' border elements are dependent upon a neighbor processor's data,
communication is required. Message Passing
First Parallel Solution: Fortran storage scheme, block cyclic distribution Implement as SPMD model Master process sends initial info to workers, checks for
convergence and collects results Worker process calculates solution, communicating as necessary
with neighbor processes
Source: https://computing.llnl.gov/tutorials/parallel_comp/
Example: Simple Heat Equation (Parallel)
interior elements
border elements
Example: Simple Heat Equation (Parallel)
First Pseudo code solution: find out if I am master or worker if I am master
initialize array send each worker starting info do until all workers have converged
gather from all workers convergence data broadcast to all workers convergence signal
end do receive results from each worker
else if I am worker receive from master starting info do until all workers have converged
update time send neighbors my border info receive from neighbors their border info update my portion of solution array determine if my solution has converged send master convergence data receive from master convergence signal
end do send master results
endif
Source: https://computing.llnl.gov/tutorials/parallel_comp/
Example: Simple Heat Equation (Parallel)
Data Parallel Loops are not used. The entire array is processed in
parallel. The distribute statements layout the data in parallel. A SHIFT is used to increment or decrement an array
element. DISTRIBUTE u1 (block,cyclic) DISTRIBUTE u2 (block,cyclic) u2 = u1 + cx * (SHIFT (u1,1,dim 1) + SHIFT (u1,-1,dim 1) - 2.*u1) + cy * (SHIFT (u1,1,dim 2) + SHIFT (u1,-1,dim 2) - 2.*u1)
Source: https://computing.llnl.gov/tutorials/parallel_comp/
Example: Simple Heat Equation (Overlapping Communication and Computation)
Previous examples used blocking communications, which waits for the communication process to complete.
Computing times can often be reduced by using non-blocking communication.
Work can be performed while communication is in progress.
In the heat equation problem, neighbor processes communicated border data, then each process updated its portion of the array.
Each process could update the interior of its part of the solution array while the communication of border data is occurring, and update its border after communication has completed.
Source: https://computing.llnl.gov/tutorials/parallel_comp/http://www.mcs.anl.gov/~itf/dbpp/text/node9.html
Example: Simple Heat Equation (Overlapping Communication and Computation)
Second Pseudo code: find out if I am master or worker if I am master
initialize array send each worker starting info do until solution converged
gather from all workers convergence data broadcast to all workers convergence signal
end do receive results from each worker
else if I am worker receive from master starting info do until solution converged
update time non-blocking send neighbors my border info non-blocking receive neighbors border info update interior of my portion of solution array wait for non-blocking communication complete update border of my portion of solution array determine if my solution has converged send master convergence data receive from master convergence signal
end do send master results
endif
Source: https://computing.llnl.gov/tutorials/parallel_comp/http://www.buyya.com/cluster/v2chap1.pdf