Upload
piyush-bagaria
View
218
Download
0
Embed Size (px)
Citation preview
8/6/2019 01 08 Parallel Computing Explained
http://slidepdf.com/reader/full/01-08-parallel-computing-explained 1/21
8/6/2019 01 08 Parallel Computing Explained
http://slidepdf.com/reader/full/01-08-parallel-computing-explained 2/21
Agenda1 Parallel Computing Overview
2 How to Parallelize a Code
3 Porting Issues
4 Scalar Tuning
5 Parallel Code Tuning6 Timing and Profiling
7 Cache Tuning
8 Parallel Performance Analysis8.1 Speedup8.2 Speedup Extremes8.3 Efficiency8.4 Amdahl's Law8.5 Speedup Limitations8.6 Benchmarks8.7 Summary
9 About the IBM Regatta P690
8/6/2019 01 08 Parallel Computing Explained
http://slidepdf.com/reader/full/01-08-parallel-computing-explained 3/21
Parallel Performance Analysis
y Now that you have parallelized your code, and have run it on
a parallel computer using multiple processors you may want
to know the performance gain that parallelization has
achieved.
y This chapter describes how to compute parallel code
performance.
y Often the performance gain is not perfect, and this chapter
also explains some of the reasons for limitations on parallelperformance.
y Finally, this chapter covers the kinds of information you
should provide in a benchmark, and some sample
benchmarks are given.
8/6/2019 01 08 Parallel Computing Explained
http://slidepdf.com/reader/full/01-08-parallel-computing-explained 4/21
Speedupy The speedup of your code tells you how much performance gain is
achieved by running your program in parallel on multipleprocessors.y A simple definition is that it is the length of time it takes a program to
run on a single processor, divided by the time it takes to run on a
multiple processors.y Speedup generally ranges between 0 and p, where p is the number of
processors.
y Scalabilityy When you compute with multiple processors in a parallel
environment, you will also want to know how your code scales.y The scalability of a parallel code is defined as its ability to achieveperformance proportional to the number of processors used.
y As you run your code with more and more processors, you want tosee the performance of the code continue to improve.
y Computing speedup is a good way to measure how a program scales
as more processors are used.
8/6/2019 01 08 Parallel Computing Explained
http://slidepdf.com/reader/full/01-08-parallel-computing-explained 5/21
Speedup
y Linear Speedup
y If it takes one processor an amount of time t to do a task and if
p processors can do the task in time t / p, then you have perfect
or linear speedup (Sp= p).
y That is, running with 4 processors improves the time by a factor of 4,
running with 8 processors improves the time by a factor of 8, and so on.
y This is shown in the following illustration.
8/6/2019 01 08 Parallel Computing Explained
http://slidepdf.com/reader/full/01-08-parallel-computing-explained 6/21
Speedup Extremesy The extremes of speedup happen when speedup is
y greater than p, called super-linear speedup,
y less than 1.
y Super-Linear Speedup
y You might wonder how super-linear speedup can occur. How canspeedup be greater than the number of processors used?y The answer usually lies with the program's memory use.When using multiple
processors, each processor only gets part of the problem compared to thesingle processor case. It is possible that the smaller problem can make betteruse of the memory hierarchy, that is, the cache and the registers. For
example, the smaller problem may fit in cache when the entire problemwould not.
y When super-linear speedup is achieved, it is often an indication that thesequential code, run on one processor, had ser ious cache miss probl ems.
y The most common programs that achieve super-linear speedup
are those that solve dense linear algebra problems.
8/6/2019 01 08 Parallel Computing Explained
http://slidepdf.com/reader/full/01-08-parallel-computing-explained 7/21
Speedup Extremes
y Parallel Code Slower than Sequential Code
y When speedup is less than one, it means that the parallel code
runs slower than the sequential code.
y This happens when there isn't enough computation to be done
by each processor.
y The overhead of creating and controlling the parallel threads
outweighs the benefits of parallel computation, and it causes the
code to run slower.
y To eliminate this problem you can try to increase the problemsize or run with fewer processors.
8/6/2019 01 08 Parallel Computing Explained
http://slidepdf.com/reader/full/01-08-parallel-computing-explained 8/21
Efficiency
y Efficiency is a measure of parallel performance that is closely
related to speedup and is often also presented in a description
of the performance of a parallel program.
y Efficiency with p processors is defined as the ratio of speedup
with p processors to p.
y
Efficiency is a fraction that usually ranges between 0 and1
.y Ep=1 corresponds to perfect speedup of Sp= p.
y You can think of efficiency as describing the average speedup
per processor.
8/6/2019 01 08 Parallel Computing Explained
http://slidepdf.com/reader/full/01-08-parallel-computing-explained 9/21
Amdahl's Lawy An alternative formula for speedup is named Amdahl's Law attributed to
Gene Amdahl, one of America's great computer scientists.
y This formula, introduced in the 1980s, states that no matter how many
processors are used in a parallel run, a program's speedup will be limited by its
fraction of sequential code.
y That is, almost every program has a fraction of the code that doesn't lend itself to
parallelism.
y This is the fraction of code that will have to be run with just one processor, even
in a parallel run.
y Amdahl's Law defines speedup with p processors as follows:
y Where the term f stands for the fraction of operations done sequentially
with just one processor, and the term (1 - f) stands for the fraction of
operations done in perfect parallelism with p processors.
8/6/2019 01 08 Parallel Computing Explained
http://slidepdf.com/reader/full/01-08-parallel-computing-explained 10/21
Amdahl's Law
y The sequential fraction of code, f , is a unitless measure
ranging between 0 and 1.
y When f is 0, meaning there is no sequential code, then speedup
is p, or perfect parallelism. This can be seen by substituting f =0 in the formula above, which results in S p = p.
y When f is 1, meaning there is no parallel code, then speedup is
1, or there is no benefit from parallelism. This can be seen by
substituting f = 1 in the formula above, which results in S p = 1.
y This shows that Amdahl's speedup ranges between 1 and
p, where p is the number of processors used in a parallel
processing run.
8/6/2019 01 08 Parallel Computing Explained
http://slidepdf.com/reader/full/01-08-parallel-computing-explained 11/21
Amdahl's Lawy The interpretation of Amdahl's Law is that speedup is limited
by the fact that not all parts of a code can be run in parallel.y Substituting in the formula, when the number of processors goes to
infinity, your code's speedup is still limited by 1 / f .
y
Amdahl's Law shows that the sequential fraction of code has astrong effect on speedup.y This helps to explain the need for large problem sizes when using
parallel computers.y It is well known in the parallel computing community, that you
cannot take a small application and expect it to show good
performance on a parallel computer.y To get good performance, you need to run large applications, with
large data array sizes, and lots of computation.y The reason for this is that as the problem size increases the
opportunity for parallelism grows, and the sequential fractionshrinks, and it shrinks in its importance for speedup.
8/6/2019 01 08 Parallel Computing Explained
http://slidepdf.com/reader/full/01-08-parallel-computing-explained 12/21
Agenda
8 Parallel Performance Analysis
8.1 Speedup
8.2 Speedup Extremes
8.3 Efficiency8.4 Amdahl's Law
8.5Speedup Limitations
8.5.1Memory Contention Limitation
8.5.2 Problem Size Limitation
8.6 Benchmarks
8.7 Summary
8/6/2019 01 08 Parallel Computing Explained
http://slidepdf.com/reader/full/01-08-parallel-computing-explained 13/21
Speedup Limitationsy This section covers some of the reasons why a program
doesn't get perfect Speedup. Some of the reasons forlimitations on speedup are:
y T oo much I/O
y Speedup is limited when the code is I/O bound.y That is, when there is too much input or output compared to the amount
of computation.
y W rong al gor ithm
y Speedup is limited when the numerical algorithm is not suitable for a
parallel computer.y You need to replace it with a parallel algorithm.
y T oo much memory contention
y Speedup is limited when there is too much memory contention.
y You need to redesign the code with attention to data locality.
y
Cache reutilization techniques will help here.
8/6/2019 01 08 Parallel Computing Explained
http://slidepdf.com/reader/full/01-08-parallel-computing-explained 14/21
Speedup Limitationsy W rong probl em size
y Speedup is limited when the problem size is too small to take best advantageof a parallel computer.
y In addition, speedup is limited when the problem size is fixed.
y That is, when the problem size doesn't grow as you compute with moreprocessors.
y T oo much sequential codey Speedup is limited when there's too much sequential code.
y This is shown by Amdahl's Law.
y T oo much par all el over head y Speedup is limited when there is too much parallel overhead compared to the
amount of computation.
y These are the additional CPU cycles accumulated in creating parallel regions,creating threads, synchronizing threads, spin/blocking threads, and endingparallel regions.
y Load imbalancey Speedup is limited when the processors have different workloads.
y The processors that finish early will be idle while they are waiting for theother processors to catch up.
8/6/2019 01 08 Parallel Computing Explained
http://slidepdf.com/reader/full/01-08-parallel-computing-explained 15/21
Memory Contention Limitationy Gene Golub, a professor of Computer Science at Stanford University,
writes in his book on parallel computing that the best way to definememory contention is with the word delay .
y When different processors all want to read or write into the main memory,there is a delay until the memory is free.
y On the SGI Origin2000 computer, you can determine whether yourcode has memory contention problems by using SGI's perfex utility.
y The perfex utility is covered in the Cache Tuning lecture in this course.
y You can also refer to SGI's manual page, man per f ex, for more details.
y On the Linux clusters, you can use the hardware performance counter
tools to get information on memory performance.y On the IA32 platform, use perfex, vprof, hmpcount, psrun/perfsuite.
y On the IA64 platform, use vprof, pfmon, psrun/perfsuite.
8/6/2019 01 08 Parallel Computing Explained
http://slidepdf.com/reader/full/01-08-parallel-computing-explained 16/21
Memory Contention Limitationy Many of these tools can be used with the PAPI performance counter
interface.y Be sure to refer to the man pages and webpages on the NCSA website for
more information.y If the output of the utility shows that memory contention is a problem, you
will want to use some programming techniques for reducing memorycontention.
y A good way to reduce memory contention is to access elements from theprocessor's cache memory instead of the main memory.
y Some programming techniques for doing this are:y Access arrays with unit `.
y Order nested do loops (in Fortran) so that the innermost loop index is the leftmost
index of the arrays in the loop. For the C language, the order is the opposite of Fortran.
y Avoid specific array sizes that are the same as the size of the data cache or that areexact fractions or exact multiples of the size of the data cache.
y Pad common blocks.
y These techniques are called cache tuning optimizations. The details forperforming these code modifications are covered in the section on C acheOptimization of this lecture.
8/6/2019 01 08 Parallel Computing Explained
http://slidepdf.com/reader/full/01-08-parallel-computing-explained 17/21
Problem Size Limitation
y Small Problem Size
y Speedup is almost always an increasing function of problem size.
y If there's not enough work to be done by the available
processors, the code will show limited speedup.
y The effect of small problem size on speedup is shown in the
following illustration.
8/6/2019 01 08 Parallel Computing Explained
http://slidepdf.com/reader/full/01-08-parallel-computing-explained 18/21
Problem Size Limitation
y Fixed Problem Size
y When the problem size is fixed, you can reach a point of
negative returns when using additional processors.
y As you compute with more and more processors, each
processor has less and less amount of computation to perform.
y The additional parallel overhead, compared to the amount of
computation, causes the speedup curve to start turning
downward as shown in the following figure.
8/6/2019 01 08 Parallel Computing Explained
http://slidepdf.com/reader/full/01-08-parallel-computing-explained 19/21
Benchmarks
y It will finally be time to report the parallel performance
of your application code.
y You will want to show a speedup graph with the
number of processors on the x axis, and speedup onthe y axis.
y Some other things you should report and record are:
y the date you obtained the results
y the problem sizey the computer model
y the compiler and the version number of the compiler
y any special compiler options you used
8/6/2019 01 08 Parallel Computing Explained
http://slidepdf.com/reader/full/01-08-parallel-computing-explained 20/21
Benchmarks
y When doing computational science, it is often helpful to find
out what kind of performance your colleagues are obtaining.
y In this regard, NCSA has a compilation of parallel performance
benchmarks online at
http://www.ncsa.uiuc.edu/UserInfo/Perf/NCSAbench/.
y You might be interested in looking at these benchmarks to
see how other people report their parallel performance.
y In particular, the NAMD benchmark is a report about the
performance of the NAMD program that does moleculardynamics simulations.
8/6/2019 01 08 Parallel Computing Explained
http://slidepdf.com/reader/full/01-08-parallel-computing-explained 21/21
Summary
y There are many good texts on parallel computing which treat
the subject of parallel performance analysis. Here are two
useful references:
y Scienti f ic Computing An Introduction w ith P ar all el Computing, Gene
Golub and James Ortega, Academic Press, Inc.
y P ar all el ComputingT heory and Pr actice, Michael J. Quinn,
McGraw-Hill, Inc.