25
Programming with MPI A Detailed Example By: Llewellyn Yap Wednesday, 2-23-2000 EE524 LECTURE EE524 LECTURE How to Build a Beowulf, Chapter 9

Programming with MPI – A Detailed Example By: Llewellyn Yap Wednesday, 2-23-2000 EE524 LECTURE EE524 LECTURE How to Build a Beowulf, Chapter 9

  • View
    221

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Programming with MPI – A Detailed Example By: Llewellyn Yap Wednesday, 2-23-2000 EE524 LECTURE EE524 LECTURE How to Build a Beowulf, Chapter 9

Programming with MPI – A Detailed Example

By: Llewellyn YapWednesday, 2-23-2000

EE524 • LECTUREEE524 • LECTURE

How to Build a Beowulf, Chapter 9

Page 2: Programming with MPI – A Detailed Example By: Llewellyn Yap Wednesday, 2-23-2000 EE524 LECTURE EE524 LECTURE How to Build a Beowulf, Chapter 9

2

Overview of MPI

• MPI – a distributed space model

• Decomposition can be static or dynamic– Static: fixed once and for all– Dynamic: changing in response to the

simulation

• When partitioning, consider minimizing communication

Page 3: Programming with MPI – A Detailed Example By: Llewellyn Yap Wednesday, 2-23-2000 EE524 LECTURE EE524 LECTURE How to Build a Beowulf, Chapter 9

3

Implementing High Performance, Parallel Applications for MPI

• Choose algorithm with sufficient parallelism

• Optimize a sequential version of the algorithm

• Use simplest possible MPI operations– usually blocking, standard mode procedures

• Profiling and analysis– find what operations take the most time

• Attack the most time-consuming components

Page 4: Programming with MPI – A Detailed Example By: Llewellyn Yap Wednesday, 2-23-2000 EE524 LECTURE EE524 LECTURE How to Build a Beowulf, Chapter 9

4

Steps to Good Parallel Implementation

Understand strengths and weaknesses of sequential solutions

Choose a good sequential algorithm Design a strategy for parallelization Develop a rough semi-analytic model of

how the parallel algorithm should perform

Page 5: Programming with MPI – A Detailed Example By: Llewellyn Yap Wednesday, 2-23-2000 EE524 LECTURE EE524 LECTURE How to Build a Beowulf, Chapter 9

5

Steps to Good Parallel Implementation (cont’d)

Implement MPI for interprocessor communication

Carry out measurements for verifying performance

Identify bottlenecks, sources or overhead, etc.; minimize their impact

Iterate to improve performance if possible

Page 6: Programming with MPI – A Detailed Example By: Llewellyn Yap Wednesday, 2-23-2000 EE524 LECTURE EE524 LECTURE How to Build a Beowulf, Chapter 9

6

Investigate Sorting

• Sorting is a multi-faceted, irregular, non-grid-based problem

• Used widely in database servers

• Issues of sorting (size of elements, form of result, storage etc.)

• Two approaches will be discussed1. A fairly restricted domain2. More general approach

Page 7: Programming with MPI – A Detailed Example By: Llewellyn Yap Wednesday, 2-23-2000 EE524 LECTURE EE524 LECTURE How to Build a Beowulf, Chapter 9

7

Sorting (Simple Approach)Assumptions

• Elements are positive integers

• Secondary storage (disk, tape) not used

• Auxiliary primary storage is available

• Input data uniformly distributed over range of integers

Page 8: Programming with MPI – A Detailed Example By: Llewellyn Yap Wednesday, 2-23-2000 EE524 LECTURE EE524 LECTURE How to Build a Beowulf, Chapter 9

8

Sorting (Simple Approach)The Approach

Most problems for parallel computation has already been solved for the traditional sequential architectures

Sequential solutions exist as libraries, system calls or language constructs – will be used as building blocks for a parallel solution

Page 9: Programming with MPI – A Detailed Example By: Llewellyn Yap Wednesday, 2-23-2000 EE524 LECTURE EE524 LECTURE How to Build a Beowulf, Chapter 9

9

Sorting (Simple Approach)The Approach (cont’d)

Advantages of this approach leverages the design, debugging and optimization that has been performed for the sequential case.

Assume that we have at disposal an optimized, debugged, sequential function ‘isort’ that sorts arrays of integers in the memory of a single processor.

Page 10: Programming with MPI – A Detailed Example By: Llewellyn Yap Wednesday, 2-23-2000 EE524 LECTURE EE524 LECTURE How to Build a Beowulf, Chapter 9

10

Sorting (Simple Approach)The Algorithm

• Partially pre-sort the elements so that all elements in processor p are less than all those in higher-numbered processors– Recall that on Beowulf systems, the high

latency of network communication favors transmission of large messages over small ones

– Determine range of values to each processor• Values between p*(INT_MAX/commsize) and (p+1)*(INT_MAX/commsize)-1, inclusive

Page 11: Programming with MPI – A Detailed Example By: Llewellyn Yap Wednesday, 2-23-2000 EE524 LECTURE EE524 LECTURE How to Build a Beowulf, Chapter 9

11

Sorting (Simple Approach)The Algorithm (cont’d)

– Each processor scans its own list and for other elements, labels them to a destination processor

– Elements placed in buffer specific to that processor

• Communicate data between processors

Page 12: Programming with MPI – A Detailed Example By: Llewellyn Yap Wednesday, 2-23-2000 EE524 LECTURE EE524 LECTURE How to Build a Beowulf, Chapter 9

12

Sorting (Simple Approach)The Algorithm (cont’d)

• MPI provides communication tools– MPI_Alltoallv– Requires each processor to argue how

much data is incoming from every partner, and exactly where it should go

• First distribute lengths with MPI_Alltoall• Allocate contiguous space for all outgoing

elements (temporarily)

• Pacreate - Initialize the returned structure

Page 13: Programming with MPI – A Detailed Example By: Llewellyn Yap Wednesday, 2-23-2000 EE524 LECTURE EE524 LECTURE How to Build a Beowulf, Chapter 9

13

Analysis of Integer Sort

Quality of a parallel implementation is assessed by measuring

• speedup s(P) = T(1)/T(P)

• efficiency (P) = T(1)/(P×T(P)) = s(P)/P

• overhead (P) = (P×T(P)-T(1))/T(1) = (1-)/– where T(1) = best available implementation on a single

processor

• Overhead is useful because it is additive

Page 14: Programming with MPI – A Detailed Example By: Llewellyn Yap Wednesday, 2-23-2000 EE524 LECTURE EE524 LECTURE How to Build a Beowulf, Chapter 9

14

Sources of Overhead

• Communication

• Redundancy

• Extra Work

• Load Imbalance

• Waiting

Page 15: Programming with MPI – A Detailed Example By: Llewellyn Yap Wednesday, 2-23-2000 EE524 LECTURE EE524 LECTURE How to Build a Beowulf, Chapter 9

15

Sources of OverheadCommunication

• Time spent communicating in parallel code (exclude sequential implementation)

• Easy to estimate; largest contribution to overhead (in sort example)– examine MPI_Alltoall and MPI_Alltoallv calls

• for MPICH, calls are implemented as loops over point-to-point calls, hence

Tcomm = 2 × P × tlatency + sizeof(local arrays)/bandwidth

Page 16: Programming with MPI – A Detailed Example By: Llewellyn Yap Wednesday, 2-23-2000 EE524 LECTURE EE524 LECTURE How to Build a Beowulf, Chapter 9

16

Sources of OverheadRedundancy

• Performs same computation on many processors

• P-1 processors not carrying out useful work

• Negligible for sort1• Some O(1) operations (calling malloc to

obtain temporary space) do not impact performance

Page 17: Programming with MPI – A Detailed Example By: Llewellyn Yap Wednesday, 2-23-2000 EE524 LECTURE EE524 LECTURE How to Build a Beowulf, Chapter 9

17

Sources of OverheadExtra Work

• Parallel computation that does not take place in a sequential implementation– e.g. for sort1: computing processor

destination for every input element

Page 18: Programming with MPI – A Detailed Example By: Llewellyn Yap Wednesday, 2-23-2000 EE524 LECTURE EE524 LECTURE How to Build a Beowulf, Chapter 9

18

Sources of OverheadLoad Imbalance

• Measures extra time spent by the slowest processor, in excess of the mean over all processors

• Load balance should satisfy with a Gaussian distribution of ~ N(N/P,N/P×sqrt((P-1)/N)

imbal=(nlargest-nmean)/nmean>O(1)×sqrt((P-1)/N)

Page 19: Programming with MPI – A Detailed Example By: Llewellyn Yap Wednesday, 2-23-2000 EE524 LECTURE EE524 LECTURE How to Build a Beowulf, Chapter 9

19

Sources of OverheadWaiting

• Fine-grained imbalance even though the overall load may be balanced– e.g. frequent synchronization between

short computations

• For sort, synchronization occurs during calls to MPI_Alltoall and MPI_Alltoallv– occurs immediately after initial

decomposition; overhead negligible

Page 20: Programming with MPI – A Detailed Example By: Llewellyn Yap Wednesday, 2-23-2000 EE524 LECTURE EE524 LECTURE How to Build a Beowulf, Chapter 9

20

Measurement of Integer Sort

• upshot - (MPICH)– Graphical tool to render a visual

representation of parallel program behavior– Logs the time spent in different phases– Goal’s to improve the performance

Page 21: Programming with MPI – A Detailed Example By: Llewellyn Yap Wednesday, 2-23-2000 EE524 LECTURE EE524 LECTURE How to Build a Beowulf, Chapter 9

21

More General SortingPerformance Improvement

• Faster sequential sort routines– D.E. Knuth, The Art of Computer Programming volume 3:

Sorting and Searching, Addison Wesley, 1973

• Relax restrictions on input data– sort more general objects

• May no long use an integer key– use of compar function– choosing “fenceposts”

Page 22: Programming with MPI – A Detailed Example By: Llewellyn Yap Wednesday, 2-23-2000 EE524 LECTURE EE524 LECTURE How to Build a Beowulf, Chapter 9

22

More General SortingApproach

• Solution to choosing fencepost: oversample

• Select nfence objects– larger value of nfence, better load balance– but results in more work– nfence value difficult to determine a priori

• MPI_Allgather and bsearch to divide data into more bins than processors

Page 23: Programming with MPI – A Detailed Example By: Llewellyn Yap Wednesday, 2-23-2000 EE524 LECTURE EE524 LECTURE How to Build a Beowulf, Chapter 9

23

• Look at population of each bin and bins to processors, achieving load balance

• MPI_Allreduce computes sum over all processors of the count in each bucket

• Finally, MPI_Alltoallv delivers elements to the correct destinations and a call to qsort completes the local sort in each processor

More General SortingApproach (cont’d)

Page 24: Programming with MPI – A Detailed Example By: Llewellyn Yap Wednesday, 2-23-2000 EE524 LECTURE EE524 LECTURE How to Build a Beowulf, Chapter 9

24

Analysis of General Sort

• More complicated program, as it invokes more MPI routines

• Tradeoff between cost of selecting more fencepost and improving load balance by using more samples

• Author chooses an intermediate case: P=12, N=1M, object size=32 bytes

Page 25: Programming with MPI – A Detailed Example By: Llewellyn Yap Wednesday, 2-23-2000 EE524 LECTURE EE524 LECTURE How to Build a Beowulf, Chapter 9

25

Summary

• Trust no one

• A performance model

• Instrumentation and graphs

• Graphical tools

• Superlinear speed up

• Enough is enough