Upload
adnanajm
View
216
Download
0
Embed Size (px)
Citation preview
8/8/2019 01-Parallel Computing Explained
1/237
8/8/2019 01-Parallel Computing Explained
2/237
Agenda1 Parallel Computing Overview2 How to Parallelize a Code3 Porting Issues
4 Scalar Tuning5 Parallel Code Tuning6 Timing and Profiling7 Cache Tuning8 Parallel Performance Analysis9 About the IBM Regatta P690
8/8/2019 01-Parallel Computing Explained
3/237
Agenda1 Parallel Computing Overview
1 .1 Introduction to Parallel Computing1 .1 .1 Parallelism in our Daily Lives1 .1 .2 Parallelism in Computer Programs1 .1 .3 Parallelism in Computers1 .1 .3.4 Disk Parallelism1 .1 .4 Performance Measures1 .1 .5 More Parallelism Issues
1.2 Comparison of Parallel Computers
1 .3 Summary
8/8/2019 01-Parallel Computing Explained
4/237
Parallel Computing Overviewy W ho should read this chapter?
y New Users to learn concepts and terminology.y Intermediate Users for review or reference.y
Management Staff to understand the basic concepts even if you dont plan to do any programming.y Note: Advanced users may opt to skip this chapter.
8/8/2019 01-Parallel Computing Explained
5/237
Introduction to Parallel Computing y High performance parallel computers
y can solve large problems much faster than a desktop computery fast CPUs, large memory, high speed interconnects, and high speed
input/outputy
able to speed up computationsy by making the sequential components run fastery by doing more operations in parallel
y High performance parallel computers are in demandy need for tremendous computational capabilities in science,
engineering, and business.y require gigabytes/terabytes f memory and gigaflops/teraflops of
performancey scientists are striving for petascale performance
8/8/2019 01-Parallel Computing Explained
6/237
Introduction to Parallel Computing y HPPC are used in a wide variety of disciplines.
y Meteorologists: prediction of tornadoes and thunderstormsy Computational biologists: analyze DNA sequencesy Pharmaceutical companies: design of new drugsy Oil companies: seismic explorationy W all Street: analysis of financial marketsy NASA: aerospace vehicle designy Entertainment industry: special effects in movies and
commercialsy These complex scientific and business applications all need to
perform computations on large datasets or large equations.
8/8/2019 01-Parallel Computing Explained
7/237
Parallelism in our Daily Livesy There are two types of processes that occur in computers and
in our daily lives:y Sequential processes
y occur in a strict ordery it is not possible to do the next step until the current one is completed.y Examples
y The passage of time: the sun rises and the sun sets.y W riting a term paper: pick the topic, research, and write the paper.
y Parallel processesy many events happen simultaneouslyy Examples
y Plant growth in the springtimey An orchestra
8/8/2019 01-Parallel Computing Explained
8/237
Agenda1 Parallel Computing Overview
1 .1 Introduction to Parallel Computing1 .1 .1 Parallelism in our Daily Lives1 .1 .2 Parallelism in Computer Programs
1 .1 .2.1 Data Parallelism1 .1 .2.2 Task Parallelism
1 .1 .3 Parallelism in Computers1 .1 .3.4 Disk Parallelism1
.1
.4 Performance Measures1 .1 .5 More Parallelism Issues1 .2 Comparison of Parallel Computers1 .3 Summary
8/8/2019 01-Parallel Computing Explained
9/237
Parallelism in Computer Programsy Conventional wisdom:
y Computer programs are sequential in naturey Only a small subset of them lend themselves to parallelism.y Algorithm: the "sequence of steps" necessary to do a computation.y The first 30 years of computer use, programs were run sequentially.
y The 1 980's saw great successes with parallel computers.y Dr. Geoffrey Fox published a book entitled Parallel Computing
W orks!y many scientific accomplishments resulting from parallel computingy Computer programs are parallel in naturey Only a small subset of them need to be run sequentially
8/8/2019 01-Parallel Computing Explained
10/237
Parallel Computing y W hat a computer does when it carries out more than one
computation at a time using more than one processor.y By using many processors at once, we can speedup the execution
y If one processor can perform the arithmetic in time t.y Then ideally p processors can perform the arithmetic in time t/p.y W hat if I use1 00 processors?W hat if I use1 000 processors?
y Almost every program has some form of parallelism.y You need to determine whether your data or your program can be
partitioned into independent pieces that can be run simultaneously.y Decomposition is the name given to this partitioning process.
y Types of parallelism:y data parallelismy task parallelism.
8/8/2019 01-Parallel Computing Explained
11/237
Data Parallelismy The same code segment runs concurrently on each processor,
but each processor is assigned its own part of the data towork on.y Do loops (in Fortran) define the parallelism.y The iterations must be independent of each other.y Data parallelism is called "fine grain parallelism" because the
computational work is spread into many small subtasks.y Example
y Dense linear algebra, such as matrix multiplication, is a perfectcandidate for data parallelism.
8/8/2019 01-Parallel Computing Explained
12/237
An example of data parallelism
Original Sequential Code Parallel Code
DO K=1,NDO J=1,NDO I=1,NC(I,J) = C(I,J) +A(I,K)*B(K,J)END DOEND DOEND DO
!$OMP PARALLEL DO
DO K=1,NDO J=1,NDO I=1,NC(I,J) = C(I,J) +A(I,K)*B(K,J)END DOEND DOEND DO!$END PARALLEL DO
8/8/2019 01-Parallel Computing Explained
13/237
Quick Intro to OpenMPy OpenMP is a portable standard for parallel directives
covering both data and task parallelism.y More information about OpenMP is available on theOpenMP
website.y W e will have a lecture onIntroduction to OpenMP later.
y W ith OpenMP, the loop that is performed in parallel is theloop that immediately follows the Parallel Do directive.y In our sample code, it's the K loop:
y DO K=1,N
8/8/2019 01-Parallel Computing Explained
14/237
OpenMP Loop ParallelismIteration-ProcessorAssignments
The code segment running on each processor
DO J=1,NDO I=1,NC(I,J) = C(I,J) +
A(I,K)*B(K,J)END DOEND DO
Processor Iterations
of K
Data
Elementsproc0 K=1 :5 A(I,
1 :5)B(1 :5 ,J)
proc1 K=6:1 0 A(I, 6:1 0)
B(6:1 0 ,J)
proc2 K=11 :1 5 A(I,11 :1 5)B(11 :1 5 ,J)
proc3 K=1 6:20 A(I,1 6:20)
B(1 6:20 ,J)
8/8/2019 01-Parallel Computing Explained
15/237
OpenMP Style of Parallelismy can be done incrementally as follows:
1 . Parallelize the most computationally intensive loop.2. Compute performance of the code.3.
If performance is not satisfactory, parallelize another loop.4. Repeat steps 2 and 3 as many times as needed.y The ability to performincremental parallelismis considered a
positive feature of data parallelism.y
It is contrasted with the MPI (Message Passing Interface)style of parallelism, which is an "all or nothing" approach.
8/8/2019 01-Parallel Computing Explained
16/237
Task Parallelismy Task parallelism may be thought of as the opposite of data
parallelism.y Instead of the same operations being performed on different parts
of the data, each process performs different operations.y You can use task parallelism when your program can be split into
independent pieces, often subroutines, that can be assigned todifferent processors and run concurrently.
y Task parallelism is called "coarse grain" parallelism because thecomputational work is spread into just a few subtasks.
y More code is run in parallel because the parallelism isimplemented at a higher level than in data parallelism.
y Task parallelism is often easier to implement and has less overheadthan data parallelism.
8/8/2019 01-Parallel Computing Explained
17/237
Task Parallelismy The abstract code shown in the diagram is decomposed into
4 independent code segments that are labeled A, B, C, and D.The right hand side of the diagram illustrates the 4 codesegments running concurrently.
8/8/2019 01-Parallel Computing Explained
18/237
Task Parallelism
Original Code Parallel Codeprogram main
c ode s egment l abe l ed A
c ode s egment l abe l ed B
c ode s egment l abe l ed C
c ode s egment l abe l ed D
end
program main
c ode s egment l abe l ed A
c ode s egment l abe l ed B
c ode s egment l abe l ed C
c ode s egment l abe l ed D
end
program main !$OMP PARALLEL!$OMP SECTIONSc ode s egment l abe l ed A!$OMP SECTIONc ode s egment l abe l ed B!$OMP SECTIONc ode s egment l abe l ed C
!$OMP SECTIONc ode s egment l abe l ed D!$OMP END SECTIONS!$OMP END PARALLELend
8/8/2019 01-Parallel Computing Explained
19/237
OpenMP Task Parallelismy W ith OpenMP, the code that follows each SECTION(S)
directive is allocated to a different processor. In our sampleparallel code, the allocation of code segments to processors isas follows.
Processor Code
proc0 code segmentlabeled A
proc1 code segment
labeled B
proc2 code segmentlabeled C
proc3 code segmentlabeled D
8/8/2019 01-Parallel Computing Explained
20/237
Parallelism in Computersy How parallelism is exploited and enhanced within the
operating system and hardware components of a parallelcomputer:y operating systemy arithmeticy memoryy disk
8/8/2019 01-Parallel Computing Explained
21/237
Operating System Parallelismy All of the commonly used parallel computers run a version of the
Unix operating system. In the table below each OS listed is in factUnix, but the name of the Unix OS varies with each vendor.
y For more information about Unix, a collection of Unix documentsis available.
Parallel Computer OS
SGI Origin2000 IRIX
HP V-Class HP-UX
Cray T3E Unicos
IBM SP AIX
W orkstationClusters Linux
8/8/2019 01-Parallel Computing Explained
22/237
Two Unix Parallelism Featuresy background processing facility
y W ith the Unix background processing facility you can run theexecutablea.out in the background and simultaneously view theman page for theetimefunction in the foreground. There are
two Unix commands that accomplish this:
a.out > re s u l t s &man etime
y cron featurey W ith the Unix cron feature you can submit a job that will run at
a later time.
8/8/2019 01-Parallel Computing Explained
23/237
Arithmetic Parallelismy Multiple execution units
y facilitate arithmetic parallelism.y The arithmetic operations of add, subtract, multiply, and divide (+ - * /) are
each done in a separate execution unit. This allows several execution units to beused simultaneously, because the execution units operate independently.
y F used multiply and add y is another parallel arithmetic feature.y Parallel computers are able to overlap multiply and add. This arithmetic is named
MultiplyADD (MADD) on SGI computers, and Fused Multiply Add (FMA) onHP computers. In either case, the two arithmetic operations are overlapped andcan complete in hardware in one computer cycle.
y Superscalar arithmeticy is the ability to issue several arithmetic operations per computer cycle.y It makes use of the multiple, independent execution units. On superscalar
computers there are multiple slots per cycle that can be filled with work. Thisgives rise to the name n-way superscalar, where n is the number of slots percycle. TheSGI Origin2000is called a 4-way superscalar computer.
8/8/2019 01-Parallel Computing Explained
24/237
Memory Parallelismy memory interleaving
y memory is divided into multiple banks, and consecutive data elements areinterleaved among them. For example if your computer has 2 memory banks,then data elements with even memory addresses would fall into one bank, anddata elements with odd memory addresses into the other.
y multiple memory portsy Port means a bi-directional memory pathway.W hen the data elements that are
interleaved across the memory banks are needed, the multiple memory portsallow them to be accessed and fetched in parallel, which increases the memory bandwidth (MB/s or GB/s).
y multiple levels of the memory hierarchy y There is global memory that any processor can access. There is memory that is
local to a partition of the processors. Finally there is memory that is local to asingle processor, that is, the cache memory and the memory elements held inregisters.
y C ache memory y C acheis a small memory that has fast access compared with the larger main
memory and serves to keep the faster processor filled with data.
8/8/2019 01-Parallel Computing Explained
25/237
Memory Parallelism
Memory Hierarchy Cache Memory
8/8/2019 01-Parallel Computing Explained
26/237
Disk Parallelismy RA ID (R edundant Array of I nexpensiveDisk)
y RAID disks are on most parallel computers.y The advantage of a RAID disk system is that it provides a
measure of fault tolerance.y If one of the disks goes down, it can be swapped out, and the
RAID disk system remains operational.y DiskS triping
y W hen a data set is written to disk, it is striped across the RAIDdisk system. That is, it is broken into pieces that are writtensimultaneously to the different disks in the RAID disk system.W hen the same data set is read back in, the pieces are read inparallel, and the full data set is reassembled in memory.
8/8/2019 01-Parallel Computing Explained
27/237
Agenda1 Parallel Computing Overview
1 .1 Introduction to Parallel Computing1 .1 .1 Parallelism in our Daily Lives1 .1 .2 Parallelism in Computer Programs1 .1 .3 Parallelism in Computers1 .1 .3.4 Disk Parallelism1 .1 .4 Performance Measures1 .1 .5 More Parallelism Issues
1
.2 Comparison of Parallel Computers1 .3 Summary
8/8/2019 01-Parallel Computing Explained
28/237
Performance Measuresy Peak Performance
y is the top speed at which the computer can operate.y It is a theoretical upper limit on the computer's performance.
y Sustained Performancey is the highest consistently achieved speed.y
It is a more realistic measure of computer performance.y C ost Performancey is used to determine if the computer is cost effective.
y MHzy is a measure of the processor speed.y The processor speed is commonly measured in millions of cycles per second,
where a computer cycle is defined as the shortest time in which some work can bedone.y MIP S
y is a measure of how quickly the computer can issue instructions.y Millions of instructions per second is abbreviated as MIPS, where the instructions
are computer instructions such as: memory reads and writes, logical operations ,floating point operations, integer operations, and branch instructions.
8/8/2019 01-Parallel Computing Explained
29/237
Performance Measuresy Mflops(Millions of floating point operations per second)
y measures how quickly a computer can perform floating-point operationssuch as add, subtract, multiply, and divide.
y S peedupy measures the benefit of parallelism.y
It shows how your program scales as you compute with more processors,compared to the performance on one processor.y Ideal speedup happens when the performance gain is linearly proportional to
the number of processors used.y Benchmarks
y are used to rate the performance of parallel computers and parallel
programs.y A well known benchmark that is used to compare parallel computers is theLinpack benchmark.
y Based on the Linpack results, a list is produced of theTop 500Supercomputer Sites. This list is maintained by the University of Tennesseeand the University of Mannheim.
8/8/2019 01-Parallel Computing Explained
30/237
More Parallelism Issuesy Load balancing
y is the technique of evenly dividing the workload among the processors.y For data parallelism it involves how iterations of loops are allocated to processoy Load balancing is important because the total time for the program to complete
the time spent by the longest executing thread.y
The problem sizey must be large and must be able to grow as you compute with more processors.y In order to get the performance you expect from a parallel computer you need t
run a large application with large data sizes, otherwise the overhead of passinginformation between processors will dominate the calculation time.
y Good software toolsy are essential for users of high performance parallel computers.y These tools include:
y parallel compilersy parallel debuggersy performance analysis toolsy parallel math software
y
The availability of a broad set of application software is also important.
8/8/2019 01-Parallel Computing Explained
31/237
More Parallelism Issuesy The high performance computing market is risky and chaotic. Many
supercomputer vendors are no longer in business, making theportability of your application very important.
y Aworkstation farmy is defined as a fast network connecting heterogeneous workstations.y The individual workstations serve as desktop systems for their owners.y W hen they are idle, large problems can take advantage of the unused
cycles in the whole system.y An application of this concept is the SETI project. You can participate in
searching for extraterrestrial intelligence with your home PC. Moreinformation about this project is available at theSETI Institute.
y C ondor y is software that provides resource management services for applications that
run on heterogeneous collections of workstations.y Miron Livny at the University of W isconsin at Madison is the director of the
Condor project, and has coined the phrasehigh throughput computingto describethis process of harnessing idle workstation cycles. More information is availableat the Condor Home Page.
8/8/2019 01-Parallel Computing Explained
32/237
Agenda1 Parallel Computing Overview
1 .1 Introduction to Parallel Computing1 .2 Comparison of Parallel Computers
1 .2.1 Processors1 .2.2 Memory Organization1 .2.3 Flow of Control1 .2.4 Interconnection Networks
1 .2.4.1 Bus Network1 .2.4.2 Cross-Bar Switch Network1
.2.4.3 Hypercube Network1 .2.4.4 Tree Network1 .2.4.5 Interconnection Networks Self-test
1 .2.5 Summary of Parallel Computer Characteristics1 .3 Summary
8/8/2019 01-Parallel Computing Explained
33/237
Comparison of Parallel Computersy Now you can explore the hardware components of parallel
computers:y kinds of processorsy types of memory organizationy flow of controly interconnection networks
y You will see what is common to these parallel computers,and what makes each one of them unique.
8/8/2019 01-Parallel Computing Explained
34/237
Kinds of Processorsy There are three types of parallel computers:
1 . computers with a small number of powerful processorsy Typically have tens of processors.y The cooling of these computers often requires very sophisticated and
expensive equipment, making these computers very expensive for computincenters.
y They are general-purpose computers that perform especially well onapplications that have large vector lengths.
y The examples of this type of computer are theCray SV1 and theFujitsu
VPP5000.
8/8/2019 01-Parallel Computing Explained
35/237
Kinds of Processorsy There are three types of parallel computers:
2. computers with a large number of less powerful processorsy Named aM assivelyP arallelP rocessor (MPP ), typically have thousands of
processors.y The processors are usually proprietary and air-cooled.y Because of the large number of processors, the distance between the furthest
processors can be quite large requiring a sophisticated internal network thatallows distant processors to communicate with each other quickly.
y These computers are suitable for applications with a high degree of
concurrency.y The MPP type of computer was popular in the1 980s.y Examples of this type of computer were theThinking Machines CM-2
computer, and the computers made by the MassPar company.
8/8/2019 01-Parallel Computing Explained
36/237
Kinds of Processorsy There are three types of parallel computers:
3. computers that are medium scale in between the two extremesy Typically have hundreds of processors.y The processor chips are usually not proprietary; rather they are commodity
processors like the Pentium III.y These are general-purpose computers that perform well on a wide range of
applications.y The most common example of this class is the Linux Cluster.
8/8/2019 01-Parallel Computing Explained
37/237
Trends and Examplesy Processor trends :
y The processors on todays commonly used parallel computers:
Decade Processor Type Computer Example1 970s Pipelined, Proprietary Cray-1
1 980s Massively Parallel, Proprietary Thinking Machines CM2
1 990s Superscalar, RISC, Commodity SGI Origin20002000s CISC, Commodity W orkstation Clusters
Computer Processor
SGI Origin2000 MIPS RISC R1 2000
HP V-Class HP PA 8200
Cray T3E Compaq Alpha
IBM SP IBM Power3
W orkstation Clusters Intel Pentium III, Intel Itanium
8/8/2019 01-Parallel Computing Explained
38/237
Memory Organizationy The following paragraphs describe the three types of
memory organization found on parallel computers:y distributed memoryy shared memoryy distributed shared memory
8/8/2019 01-Parallel Computing Explained
39/237
Distributed Memoryy In distributed memory computers, the total memory is partitioned
into memory that is private to each processor.y There is aN on-U niformM emory Access time (NU M A), which is
proportional to the distance between the two communicatingprocessors.
y On NUMA computers,data is accessed thequickest from a privatememory, while data fromthe most distant
processor takes thelongest to access.
y Some examples are theCray T3E, the IBM SP,and workstation clusters.
8/8/2019 01-Parallel Computing Explained
40/237
Distributed Memoryy W hen programming distributed memory computers, the
code and the data should be structured such that the bulk of a processors data accesses are to its own private (local)memory.
y This is called havinggooddata locality .
y Today's distributedmemory computers use
message passingsuch asMPI to communicate between processors asshown in the followingexample:
8/8/2019 01-Parallel Computing Explained
41/237
Distributed Memoryy One advantage of distributed memory computers is that they
are easy to scale. As the demand for resources grows,computer centers can easily add more memory andprocessors.y This is often called theLEGO blockapproach.
y The drawback is that programming of distributed memorycomputers can be quite complicated.
8/8/2019 01-Parallel Computing Explained
42/237
Shared Memoryy In shared memory computers, all processors have access to a single pool
of centralized memory with a uniform address space.y Any processor can address any memory location at the same speed so
there isU niformM emory Access time (UMA ).y Processors communicate with each other through the shared memory.y The advantages and
disadvantages of sharedmemory machines areroughly the opposite of distributed memorycomputers.
y They are easier to program because they resemble theprogramming of singleprocessor machines
y But they don't scale liketheir distributed memorycounterparts
8/8/2019 01-Parallel Computing Explained
43/237
Distributed Shared Memoryy InDistributedSharedM emory (DSM ) computers, a cluster or partition of
processors has access to a common shared memory.y It accesses the memory of a different processor cluster in a NUMA fashion.y Memory is physically distributed but logically shared.y Attention to data locality again is important.
y Distributed shared memorycomputers combine the bestfeatures of both distributedmemory computers andshared memory computers.
y That is, DSM computers have both the scalability of
distributed memorycomputers and the ease of programming of sharedmemory computers.
y Some examples of DSMcomputers are theSGIOrigin2000and theHP V-Classcomputers.
8/8/2019 01-Parallel Computing Explained
44/237
Trends and Examplesy Memory organization
trends:
y The memory
organization of todays commonlyused parallelcomputers:
Decade Memory Organization Example1 970s Shared Memory Cray-1
1 980s Distributed Memory Thinking Machines CM-21 990s Distributed Shared Memory SGI Origin2000
2000s Distributed Memory W orkstation Clusters
Computer Memory Organization
SGI Origin2000 DSMHP V-Class DSM
Cray T3E Distributed
IBM SP DistributedW
orkstation Clusters Distributed
8/8/2019 01-Parallel Computing Explained
45/237
Flow of Controly W hen you look at the control of flow you will see three types
of parallel computers:y SingleI nstructionM ultipleData (SIMD)y M ultipleI nstructionM ultipleData (MIMD)y SingleP rogramM ultipleData (SPMD)
8/8/2019 01-Parallel Computing Explained
46/237
Flynns Taxonomyy Flynns Taxonomy, devised in1 972 by Michael Flynn of Stanford
University, describes computers by how streams of instructions interactwith streams of data.
y There can be single or multiple instruction streams, and there can besingle or multiple data streams. This gives rise to 4 types of computers asshown in the diagram below:
y Flynn's taxonomynames the 4 computertypes SISD, MISD,SIMD and MIMD.
y
Of these 4, only SIMDand MIMD areapplicable to parallelcomputers.
y Another computertype, SPMD, is a special
case of MIMD.
8/8/2019 01-Parallel Computing Explained
47/237
S IMD Computersy SIMDstands forSingleI nstructionM ultipleData.y Each processor follows the same set of instructions.y W ith different data elements being allocated to each processor.y SIMD computers have distributed memory with typically thousands of simple processors,
and the processors run in lock step.y SIMD computers, popular in the1 980s, are useful for fine grain data parallel applications,
such as neural networks.y Some examples of SIMD computers
were the Thinking Machines CM-2computer and the computers from theMassPar company.
y The processors are commanded by theglobal controller that sendsinstructions to the processors.
y It saysadd , and they all add.y It saysshift to the right, and they all
shift to the right.y The processors are like obedient
soldiers, marching in unison.
8/8/2019 01-Parallel Computing Explained
48/237
MIMD Computersy MIMDstands forM ultipleI nstructionM ultipleData.y There are multiple instruction streams with separate code segments distributed
among the processors.y MIMD is actually a superset of SIMD, so that the processors can run the same
instruction stream or different instruction streams.y
In addition, there are multiple data streams; different data elements are allocatedto each processor.y MIMD computers can have either distributed memory or shared memory.y W hile the processors on SIMD
computers run in lock step, theprocessors on MIMD computersrun independently of each other.
y MIMD computers can be used foreither data parallel or task parallelapplications.
y Some examples of MIMDcomputers are theSGI Origin2000computer and theHP V-Classcomputer.
8/8/2019 01-Parallel Computing Explained
49/237
SPMD Computersy SPMDstands forSingleP rogramM ultipleData.y SPMD is a special case of MIMD.y SPMD execution happens when a MIMD computer is programmed to have the
same set of instructions per processor.y W ith SPMD computers, while the processors are running the same code
segment, each processor can run that code segment asynchronously.y Unlike SIMD, the synchronous execution of instructions is relaxed.y An example is the execution of an if statement on a SPMD computer.
y Because each processor computes with its own partition of the data elements, itmay evaluate the right hand side of the if statement differently from anotherprocessor.
y One processor may take a certain branch of the if statement, and anotherprocessor may take a different branch of the same if statement.
y Hence, even though each processor has the same set of instructions, thoseinstructions may be evaluated in a different order from one processor to the next.
y The analogies we used for describing SIMD computers can be modified forMIMD computers.
y Instead of the SIMD obedient soldiers, all marching in unison, in the MIMD worldthe processors march to the beat of their own drummer.
8/8/2019 01-Parallel Computing Explained
50/237
Summary of S IMD versus M IMDSIMD MIMD
Memory distributed memorydistriuted memory
orshared memory
Code Segment same perprocessorsameor
different
ProcessorsRun In lock step asynchronously
DataElements
different perprocessor
different perprocessor
Applications data paralleldata parallel
ortask parallel
8/8/2019 01-Parallel Computing Explained
51/237
Trends and Examplesy Flow of control trends:
y The flow of control on today:
Decade Flow of Control Computer Example1 980's SIMD Thinking Machines CM-21 990's MIMD SGI Origin2000
2000's MIMD W orkstation Clusters
Computer Flow of Control
SGI Origin2000 MIMDHP V-Class MIMD
Cray T3E MIMD
IBM SP MIMD
W orkstation Clusters MIMD
8/8/2019 01-Parallel Computing Explained
52/237
Agenda1 Parallel Computing Overview
1 .1 Introduction to Parallel Computing1 .2 Comparison of Parallel Computers
1 .2.1 Processors1
.2.2 Memory Organization1 .2.3 Flow of Control1 .2.4 Interconnection Networks
1 .2.4.1 Bus Network1 .2.4.2 Cross-Bar Switch Network1
.2.4.3 Hypercube Network1 .2.4.4 Tree Network1 .2.4.5 Interconnection Networks Self-test
1 .2.5 Summary of Parallel Computer Characteristics1 .3 Summary
8/8/2019 01-Parallel Computing Explained
53/237
Interconnection Networksy Wh at exactly is t h e interconnection network?
y The interconnection networkis made up of the wires and cables that define how themultiple processors of a parallel computer are connected to each other and to thememory units.
y The time required to transfer data is dependent upon the specific type of theinterconnection network.
y
This transfer time is called the communication time.y Wh at network c h aracteristics are important?
y Diameter: the maximum distance that data must travel for 2 processors tocommunicate.
y Bandwidth: the amount of data that can be sent through a network connection.y Latency: the delay on a network while a data packet is being stored and forwarded.
y Types of Interconnection NetworksThe network topologies (geometric arrangements of the computer network
connections) are:y Busy Cross-bar Switchy Hybercubey
Tree
8/8/2019 01-Parallel Computing Explained
54/237
Interconnection Networksy The aspects of network issues are:
y Costy Scalabilityy Reliabilityy Suitable Applicationsy
Data Ratey Diametery Degree
y G eneral Network C h aracteristicsy Some networks can be compared in terms of their degree and diameter.y Degree:how many communicating wires are coming out of each processor.
y A large degree is a benefit because it has multiple paths.y Diameter:This is the distance between the two processors that are farthest
apart.y A small diameter corresponds to low latency.
8/8/2019 01-Parallel Computing Explained
55/237
Bus Networky Bus topology is the original coaxial cable-basedLocal Area N etwork
(L AN ) topology in which the medium forms a single bus to which allstations are attached.
y The positive aspectsy It is also a mature technology that is well known and reliable.y The cost is also very low.y simple to construct.
y The negative aspectsy limited data
transmission rate.y not scalable in termsof performance.
y Example: SGI PowerChallenge.
y Only scaled to1 8processors.
8/8/2019 01-Parallel Computing Explained
56/237
Cross- Bar Switch Networky A cross-bar switch is a network that works through a switching mechanism to
access shared memory.y it scales better than the bus network but it costs significantly more.
y The telephone system uses this type of network. An example of a computerwith this type of network is the HP V-Class.
y Here is a diagram of across-bar switchnetwork which showsthe processors talkingthrough theswitchboxes to store orretrieve data in
memory.y There are multiplepaths for a processor tocommunicate with acertain memory.
y The switches determinethe optimal route totake.
8/8/2019 01-Parallel Computing Explained
57/237
Cross- Bar Switch Networky In a hypercube network, the processors are connected as if they
were corners of a multidimensional cube. Each node in an Ndimensional cube is directly connected to N other nodes.
y The fact that the number of directly
connected, "nearest neighbor",nodes increases with the total size of the network is also highly desirablefor a parallel computer.
y The degree of a hypercube network
is log n and the diameter is log n,where n is the number of processors.
y Examples of computers with thistype of network are the CM-2,NCUBE-2, and the Intel iPSC860.
8/8/2019 01-Parallel Computing Explained
58/237
Tree Networky The processors are the bottom nodes of the tree. For a processor
to retrieve data, it must go up in the network and then go backdown.
y This is useful for decision making applications that can be mappedas trees.
y The degree of a tree network is1 . The diameter of the network is2 log (n+1 )-2 where n is the number of processors.
y The Thinking Machines CM-5 is anexample of a parallel computerwith this type of network.
y Tree networks are very suitable fordatabase applications because itallows multiple searches throughthe database at a time.
8/8/2019 01-Parallel Computing Explained
59/237
Interconnected Networksy Torus Network: A mesh with wrap-around connections in
both the x and y directions.y Multistage Network: A network with more than one
networking unit.y Fully Connected Network: A network where every processor
is connected to every other processor.y Hypercube Network: Processors are connected as if they
were corners of a multidimensional cube.y
Mesh Network: A network where each interior processor isconnected to its four nearest neighbors.
8/8/2019 01-Parallel Computing Explained
60/237
Interconnected Networksy Bus Based Network: Coaxial cable based LAN topology in
which the medium forms a single bus to which all stations areattached.
y Cross-bar Switch Network: A network that works through aswitching mechanism to access shared memory.
y Tree Network: The processors are the bottom nodes of thetree.
y Ring Network: Each processor is connected to two othersand the line of connections forms a circle.
8/8/2019 01-Parallel Computing Explained
61/237
Summary of Parallel ComputerCharacteristicsy How many processors does the computer have?
y 1 0s?y 1 00s?y 1 000s?
y How powerful are the processors?y what's the MHz ratey what's the MIPS rate
y W
hat's the instruction set architecture?y RISCy CISC
8/8/2019 01-Parallel Computing Explained
62/237
Summary of Parallel ComputerCharacteristicsy How much memory is available?
y total memoryy memory per processor
y W hat kind of memory?y distributed memoryy shared memoryy distributed shared memory
y W hat type of flow of control?y
SIMDy MIMDy SPMD
8/8/2019 01-Parallel Computing Explained
63/237
Summary of Parallel ComputerCharacteristicsy W hat is the interconnection network?
y Busy Crossbary Hypercubey Treey Torusy Multistagey Fully Connectedy Meshy Ringy Hybrid
8/8/2019 01-Parallel Computing Explained
64/237
Design decisions made by some of themajor parallel computer vendors
Computer ProgrammingStyle OS Processors MemoryFlow of Control Network
SGIOrigin2000
OpenMPMPI IRIX
MIPS RISCR1 0000 DSM MIMD
CrossbarHypercube
HP V-Class OpenMPMPI HP-UX HP PA 8200 DSM MIMDCrossbarRing
Cray T3E SHMEM Unicos Compaq Alpha Distributed MIMD Torus
IBM SP MPI AIX IBM Power3 Distributed MIMD IBM Switch
W orkstationClusters MPI Linux
Intel PentiumIII Distributed MIMD
MyrinetTree
8/8/2019 01-Parallel Computing Explained
65/237
Summaryy This completes our introduction to parallel computing.y You have learned about parallelism in computer programs, and
also about parallelism in the hardware components of parallelcomputers.
y In addition, you have learned about the commonly used parallelcomputers, and how these computers compare to each other.
y There are many good texts which provide an introductorytreatment of parallel computing. Here are two useful references:
Highly Parallel C omputing,Second Edition
George S. Almasi and Allan GottliebBenjamin/Cummings Publishers,1 994
Parallel C omputing Theory and PracticeMichael J. QuinnMcGraw-Hill, Inc.,1 994
8/8/2019 01-Parallel Computing Explained
66/237
Agenda1 Parallel Computing Overview2 How to Parallelize a Code
2.1 Automatic Compiler Parallelism2.2 Data Parallelism by Hand2.3 Mixing Automatic and Hand Parallelism2.4 Task Parallelism2.5 Parallelism Issues
3 Porting Issues4 Scalar Tuning5 Parallel Code Tuning6 Timing and Profiling7 Cache Tuning8 Parallel Performance Analysis9 About the IBM Regatta P690
8/8/2019 01-Parallel Computing Explained
67/237
How to Parallelize a Codey This chapter describes how to turn a single processor
program into a parallel one, focusing on shared memorymachines.
y Both automatic compiler parallelization and parallelization byhand are covered.
y The details for accomplishing both data parallelism and taskparallelism are presented.
8/8/2019 01-Parallel Computing Explained
68/237
Automatic Compiler Parallelismy Automatic compiler parallelism enables you to use a
single compiler option and let the compiler do the work.y The advantage of it is that its easy to use.y
The disadvantages are:y The compiler only does loop level parallelism, not taskparallelism.
y The compiler wants to parallelize every do loop in your code.If you have hundreds of do loops this creates way too muchparallel overhead.
8/8/2019 01-Parallel Computing Explained
69/237
Automatic Compiler Parallelismy To use automatic compiler parallelism on a Linux system
with the Intel compilers, specify the following.
ifort - para ll e l -O2 ... prog.f
y The compiler creates conditional code that will run with anynumber of threads.
y Specify the number of threads and make sure you still get theright answers withsetenv :
s etenv OMP_NUM_THREADS 4 a.out > re s u l t s
8/8/2019 01-Parallel Computing Explained
70/237
Data Parallelism by Handy First identify the loops that use most of the CPU time (the Profiling
lecture describes how to do this).y By hand, insert into the code OpenMP directive(s) just before the
loop(s) you want to make parallel.y Some code modifications may be needed to remove data dependencies
and other inhibitors of parallelism.y Use your knowledge of the code and data to assist the compiler.y For the SGI Origin2000 computer, insert into the code an OpenMP
directive just before the loop that you want to make parallel.
!$OMP PARALLELDO do i =1, n
l ot s of c omputation ... end do
!$OMP END PARALLEL DO
8/8/2019 01-Parallel Computing Explained
71/237
Data Parallelism by Handy Compile with the mp compiler option.
f90 - mp ... prog.f
y As before, the compiler generates conditional code that will run with anynumber of threads.
y If you want to rerun your program with a different number of threads, you donot need to recompile, just re-specify thesetenv command.s etenv OMP_NUM_THREADS 8a.out > re s u l t s2
y The setenv command can be placed anywhere before thea.out command.y The setenv command must be typed exactly as indicated. If you have a typo,
you will not receive a warning or error message. To make sure that the setenvcommand is specified correctly, type:s etenv
y It produces a listing of your environment variable settings.
8/8/2019 01-Parallel Computing Explained
72/237
Mixing Automatic and Hand Parallelismy You can have one source file parallelized automatically by the
compiler, and another source file parallelized by hand.Suppose you split your code into two files named prog1.f and prog2.f .
f90 -c - apo prog 1 .f ( automati c // for prog 1 .f )f90 -c - mp prog 2 .f ( by h and // for prog 2 .f )f90 prog 1 .o prog 2 .o (c reate s one exe c utab l e )a.out > re s u l t s ( run s t h e exe c utab l e )
8/8/2019 01-Parallel Computing Explained
73/237
Task Parallelismy You can accomplish task parallelism as follows:
!$OMP PARALLEL!$OMP SECTIONS l ot s of c omputation in part A !$OMP SECTION
l ot s of c omputation in part B ... !$OMP SECTION l ot s of c omputation in part C ... !$OMP END SECTIONS!$OMP END PARALLEL
y
Compile with the mp compiler option.f90 - mp prog.fy Use thesetenv command to specify the number of threads.
s etenv OMP_NUM_THREADS 3a.out > re s u l t s
8/8/2019 01-Parallel Computing Explained
74/237
Parallelism Issuesy There are some issues to consider when parallelizing a
program.y Should data parallelism or task parallelism be used?y
Should automatic compiler parallelism or parallelism by hand be used?y W hich loop in a nested loop situation should be the
one that becomes parallel?y How many threads should be used?
8/8/2019 01-Parallel Computing Explained
75/237
Agenda1 Parallel Computing Overview2 How to Parallelize a Code3 Porting Issues
3.1 Recompile3.2W ord Length3.3 Compiler Options for Debugging3.4 Standards Violations3.5 IEEE Arithmetic Differences3.6 Math Library Differences3.7 Compute Order Related Differences3.8 Optimization Level Too High3.9 Diagnostic Listings3.1 0 Further Information
8/8/2019 01-Parallel Computing Explained
76/237
8/8/2019 01-Parallel Computing Explained
77/237
Recompiley Some codes just need to be recompiled to get accurate results.y The compilers available on the NCSA computer platforms are
shown in the following table:
Language SGI Origin2000 IA-32 Linux IA-64 Linux
MIPSpro PortlandGroup Intel GNUPortlandGroup Intel GNU
Fortran 77 f77 ifort g77 pgf77 ifort g77
Fortran 90 f90 ifort pgf90 ifort
Fortran 90 f95 ifort ifortHighPerformanceFortran
pghpf pghpf
C cc icc gcc pgcc icc gccC++ CC icpc g++ pgCC icpc g++
8/8/2019 01-Parallel Computing Explained
78/237
Word Lengthy Code flaws can occur when you are porting your code to a
different word length computer.y For C, the size of an integer variable differs depending on the
machine and how the variable is generated. On the IA32 and IA64Linux clusters, the size of an integer variable is 4 and 8 bytes,respectively. On the SGI Origin2000, the corresponding value is 4 bytes if the code is compiled with the n32 flag, and 8 bytes if compiled without any flags or explicitly with the 64 flag.
y For Fortran, the SGI MIPSpro and Intel compilers contain the
following flags to set default variable size.y -in where n is a number: set the default INTEGER to INTEGER*n.The value of n can be 4 or 8 on SGI, and 2, 4, or 8 on the Linuxclusters.
y -rn where n is a number: set the default REAL to REAL*n. The valueof n can be 4 or 8 on SGI, and 4, 8, or1 6 on the Linux clusters.
8/8/2019 01-Parallel Computing Explained
79/237
Compiler Options for Debugging y On the SGI Origin2000, the MIPSpro compilers include
debugging options via the DEBUG:group. The syntax is asfollows:-DEBUG: option 1[= va l ue 1]: option 2[= va l ue 2] ...
y Two examples are:y Array-bound checking: check for subscripts out of range at
runtime.-DEBUG:s ub sc ript _ch e ck=ON
y
Force all un-initialized stack, automatic and dynamicallyallocated variables to be initialized.-DEBUG: trap _ uninitia l ized =ON
8/8/2019 01-Parallel Computing Explained
80/237
Compiler Options for Debugging y On the IA32 Linux cluster, the Fortran compiler is
equipped with the following C flags for runtimediagnostics:y -CA: pointers and allocatable referencesy -CB: array and subscript boundsy -CS: consistent shape of intrinsic procedurey -CU: use of uninitialized variablesy -CV: correspondence between dummy and actual
arguments
8/8/2019 01-Parallel Computing Explained
81/237
Standards Violationsy Code flaws can occur when the program has non-ANSI
standard Fortran coding.y ANSI standard Fortran is a set of rules for compiler writers that
specify, for example, the value of the do loop index upon exitfrom the do loop.
y S tandards Violations Detectiony To detect standards violations on the SGI Origin2000 computer
use the -ansiflag.y This option generates a listing of warning messages for the use
of non-ANSI standard coding.y On the Linux clusters, the -ansi[ -] flag enables/disables
assumption of ANSI conformance.
8/8/2019 01-Parallel Computing Explained
82/237
IEEE Arithmetic Differencesy Code flaws occur when the baseline computer conforms to the
IEEE arithmetic standard and the new computer does not.y The IEEE Arithmetic Standard is a set of rules governing arithmetic
roundoff and overflow behavior.y For example, it prohibits the compiler writer from replacing x/y
with x *recip (y) since the two results may differ slightly for someoperands. You can make your program strictly conform to the IEEEstandard.
y To make your program conform to the IEEE Arithmetic Standardson the SGI Origin2000 computer use:f90 -OPT:IEEE arithmetic=n ... prog. f where n is 1, 2, or 3.
y This option specifies the level of conformance to the IEEEstandard where1 is the most stringent and 3 is the most liberal.y On the Linux clusters, the Intel compilers can achieve
conformance to IEEE standard at a stringent level with the mpflag, or a slightly relaxed level with the mp1 flag.
8/8/2019 01-Parallel Computing Explained
83/237
Math Library Differencesy Most high-performance parallel computers are equipped with
vendor-supplied math libraries.y On the SGI Origin2000 platform, there areSGI/ C ray Scientific
Library ( SCS L) andC omplib.sgimath.y SCSL contains Level1 , 2, and 3 Basic Linear Algebra Subprograms
(BLAS), LAPACK and Fast Fourier Transform (FFT) routines.y SCSL can be linked with lscs for the serial version, or mp
lscs_mp for the parallel version.y The complib library can be linked with lcomplib.sgimath for the
serial version, or mp lcomplib.sgimath_mp for the parallelversion.
y The Intel Math Kernel Library (MKL) contains the complete set of functions from BLAS, the extended BLAS (sparse), the completeset of LAPACK routines, and Fast Fourier Transform (FFT)routines.
8/8/2019 01-Parallel Computing Explained
84/237
Math Library Differencesy On the IA32 Linux cluster, the libraries to link to are:
y For BLAS:-L/usr /loca l/inte l/mk l/lib/32 -l mk l -lguide lp thread y For LAPACK:-L/usr /loca l/inte l/mk l/lib/32 l mk l_la pack -lmk l -lguide
lp thread
y W hen calling MKL routines from C/C++ programs, you alsoneed to link with lF90 .
y On the IA64 Linux cluster, the corresponding libraries are:y For BLAS:-L/usr /loca l/inte l/mk l/lib/64 l mk l_ itp lp thread y
For LAPACK:-L/usr /loca l/inte l/mk l/lib/64 l mk l_la pack lmk l_ itp lpthread y W hen calling MKL routines from C/C++ programs, you also
need to link with-lPEPCF90 lCEPCF90 lF90 -l intrins
8/8/2019 01-Parallel Computing Explained
85/237
Compute Order Related Differencesy Code flaws can occur because of the non-deterministic computation of
data elements on a parallel computer. The compute order in which thethreads will run cannot be guaranteed.
y For example, in a data parallel program, the 50th index of a do loop may becomputed before the1 0th index of the loop. Furthermore, the threads may
run in one order on the first run, and in another order on the next run of theprogram.y N ote:: If your algorithm depends on data being compared in a specific order,
your code is inappropriate for a parallel computer.y Use the following method to detect compute order related differences:
y If your loop looks likey DO I = 1 , N change it toy DO I = N, 1 , -1 The results should not change if the iterations are
independent
8/8/2019 01-Parallel Computing Explained
86/237
Optimization Level Too Highy Code flaws can occur when the optimization level has been set too
high thus trading speed for accuracy.y The compiler reorders and optimizes your code based on
assumptions it makes about your program. This can sometimes cause
answers to change at higher optimization level.y S etting t h e Optimization Levely Both SGI Origin2000 computer and IBM Linux clusters provide
Level 0 (no optimization) to Level 3 (most aggressive) optimization,using the O{0,1 ,2, or 3} flag. One should bear in mind that Level 3
optimization may carry out loop transformations that affect thecorrectness of calculations. Checking correctness and precision of calculation is highly recommended when O3 is used.
y For example on the Origin 2000y f90 -O0 prog.f turns off all optimizations.
8/8/2019 01-Parallel Computing Explained
87/237
8/8/2019 01-Parallel Computing Explained
88/237
Diagnostic Listingsy The SGI Origin 2000 compiler will generate all
kinds of diagnostic warnings and messages, butnot always by default. Some useful listing options
are:f90 -l i s ting ... f90 - fu llw arn ... f90 -sh o wdefau l t s ... f90 - ver s ion ... f90 -h e l p ...
8/8/2019 01-Parallel Computing Explained
89/237
Further Informationy SG I
y man f77/f90/ccy man debug_groupy man mathy
man complib.sgimathy MIPSpro 64-Bit Porting and Transition Guidey Online Manuals
y Linux clusters pagesy ifort/icc/icpc help (IA32, IA64, Intel64)y Intel Fortran Compiler for Linuxy Intel C/C++ Compiler for Linux
8/8/2019 01-Parallel Computing Explained
90/237
Agenday 1 Parallel Computing Overviewy 2 How to Parallelize a Codey 3 Porting Issuesy
4 Scalar Tuningy 4.1 Aggressive Compiler Optionsy 4.2 Compiler Optimizationsy 4.3 Vendor Tuned Codey
4.4 Further Information
8/8/2019 01-Parallel Computing Explained
91/237
Scalar Tuning y If you are not satisfied with the performance of your
program on the new computer, you can tune the scalar codeto decrease its runtime.
y This chapter describes many of these techniques:y The use of the most aggressive compiler optionsy The improvement of loop unrollingy The use of subroutine inliningy The use of vendor supplied tuned code
y The detection of cache problems, and their solution arepresented in the Cache Tuning chapter.
8/8/2019 01-Parallel Computing Explained
92/237
8/8/2019 01-Parallel Computing Explained
93/237
Aggressive Compiler Optionsy It should be noted that O3 might carry out loop
transformations that produce incorrect results in some codes.y It is recommended that one compare the answer obtained from
Level 3 optimization with one obtained from a lower-leveloptimization.
y On the SGI Origin2000 and the Linux clusters, O3 can beused together with OPT:IEEE_arithmetic=n (n=1 ,2, or 3)and mp (or mp1 ), respectively, to enforce operationconformance to IEEE standard at different levels.
y
On the SGI Origin2000, the option-Of ast = ip27is also available. This option specifies the most aggressiveoptimizations that are specifically tuned for the Origin2000computer.
8/8/2019 01-Parallel Computing Explained
94/237
Agenday 1 Parallel Computing Overviewy 2 How to Parallelize a Codey 3 Porting Issuesy 4 Scalar Tuning
y 4.1 Aggressive Compiler Optionsy 4.2 Compiler Optimizations
y 4.2.1 Statement Levely 4.2.2 Block Levely 4.2.3 Routine Levely 4.2.4 Software Pipeliningy 4.2.5 Loop Unrollingy 4.2.6 Subroutine Inliningy 4.2.7 Optimization Reporty 4.2.8 Profile-guided Optimization (PGO)
y 4.3 Vendor Tuned Codey 4.4 Further Information
8/8/2019 01-Parallel Computing Explained
95/237
Compiler Optimizationsy The various compiler optimizations can be classified as
follows:y Statement Level Optimizationsy Block Level Optimizationsy Routine Level Optimizationsy Software Pipeliningy Loop Unrollingy Subroutine Inlining
y Each of these are described in the following sections.
8/8/2019 01-Parallel Computing Explained
96/237
Statement Levely Constant Folding
y Replace simple arithmetic operations on constants with the pre-computed result.
y y = 5+7 becomes y =1 2y S h ort Circuiting
y Avoid executing parts of conditional tests that are not necessary.y if (I.eq.J .or. I.eq.K) expression
when I=J immediately compute the expressiony R egister A ssignment
y Put frequently used variables in registers.
8/8/2019 01-Parallel Computing Explained
97/237
Block Levely Dead C odeElimination
y Remove unreachable code and code that is never executed orused.
y InstructionSchedulingy Reorder the instructions to improve memory pipelining.
8/8/2019 01-Parallel Computing Explained
98/237
Routine Levely S trengthR eduction
y Replace expressions in a loop with an expression that takes fewercycles.
y C ommonSubexpressionsEliminationy
Expressions that appear more than once, are computed once, and theresult is substituted for each occurrence of the expression.y C onstant Propagation
y Compile time replacement of variables with constants.y L
oop InvariantE
liminationy Expressions inside a loop that don't change with the do loop index aremoved outside the loop.
8/8/2019 01-Parallel Computing Explained
99/237
Software Pipelining y Software pipelining allows the mixing of operations from
different loop iterations in each iteration of the hardwareloop. It is used to get the maximum work done per clockcycle.
y N ote:On the R1 0000s there is out-of-order execution of instructions, and software pipelining may actually get in theway of this feature.
8/8/2019 01-Parallel Computing Explained
100/237
Loop Unrolling y The loops stride (or step) value is increased, and the body of the loop isreplicated. It is used to improve the scheduling of the loop by giving a
longer sequence of straight line code. An example of loop unrollingfollows:
Original Loop U nrolled Loopdo I = 1, 99 do I = 1, 99 , 3c(I) = a (I) + b (I) c(I) = a (I) + b (I)enddo c(I+1) = a (I+1) + b (I+1)
c(I+2) = a (I+2) + b (I+2)enddo
There is a limit to the amount of unrolling that can take place because thereare a limited number of registers.
y On the SGI Origin2000, loops are unrolled to a level of 8 by default.You can unroll to a level of 1 2 by specifying:
f90 -O3 -OPT: unro ll_ time s_ max =12 ... prog.fy On the IA32 Linux cluster, the corresponding flag is unro ll and -unro ll0
for unrolling and no unrolling, respectively.
8/8/2019 01-Parallel Computing Explained
101/237
Subroutine Inlining y Subroutine inlining replaces a call to a subroutine with
the body of the subroutine itself.y One reason for using subroutine inlining is that when a
subroutine is called inside a do loop that has a hugeiteration count, subroutine inlining may be moreefficient because it cuts down on loop overhead.
y However, the chief reason for using it is that do loops
that contain subroutine calls may not parallelize.
8/8/2019 01-Parallel Computing Explained
102/237
Subroutine Inlining y On the SGI Origin2000 computer, there are several options to
invoke inlining:y Inline all routines except those specified to -INLINE:never
f90 -O3 -INLINE: a ll prog.f :y Inline no routines except those specified to -INLINE:must
f90 -O3 -INLINE: none prog.f :y Specify a list of routines to inline at every call
f90 -O3 -INLINE: mu s t =s ubrname prog.f :y Specify a list of routines never to inline
f90 -O3 -INLINE: never =s ubrname prog.f :y On the Linux clusters, the following flags can invoke function inlining:
y inline function expansion for calls defined within the current source file- ip :
y inline function expansion for calls defined in separate files- ipo :
8/8/2019 01-Parallel Computing Explained
103/237
Optimization Reporty Intel 9.x and later compilers can generate reports that provide
useful information on optimization done on different parts of yourcode.y To generate such optimization reports in a file filename, add the flag -
opt-report-file filename.y If you have a lot of source files to process simultaneously, and you use
a makefile to compile, you can also use make's "suffix" rules to haveoptimization reports produced automatically, each with a uniquename. For example,.f.o :
ifort -c - o $@ $(FFLAGS) - opt - report - fi l e $* .opt $* .f y creates optimization reports that are named identically to the original
Fortran source but with the suffix ".f" replaced by ".opt".
8/8/2019 01-Parallel Computing Explained
104/237
Optimization Reporty To help developers and performance analysts navigate through the
usually lengthy optimization reports, the NCSA program OptView isdesigned to provide an easy-to-use and intuitive interface that allows theuser to browse through their own source code, cross-referenced withthe optimization reports.
y
OptView is installed on NCSA's IA64 Linux cluster under the directory/usr /a pp s /too ls /bin . You can either add that directory to your UNIXPATH or you can invoke optview using an absolute path name. You'llneed to be using the X-W indow system and to have set your DISPLAYenvironment variable correctly for OptView to work.
y Optview can provide a quick overview of which loops in a source codeor source codes among multiple files are highly optimized and whichmight need further work. For a detailed description of use of OptView,readers see:http://perfsuite.ncsa.uiuc.edu/OptView/
8/8/2019 01-Parallel Computing Explained
105/237
Profile-guided Optimization (PGO)y Profile-guided optimization allows Intel compilers to use
valuable runtime information to make better decisions aboutfunction inlining and interprocedural optimizations togenerate faster codes. Its methodology is illustrated as
follows:
8/8/2019 01-Parallel Computing Explained
106/237
8/8/2019 01-Parallel Computing Explained
107/237
Vendor Tuned Codey Vendor math libraries have codes that are optimized for their
specific machine.y On the SGI Origin2000 platform, Complib.sgimath and SCSL
are available.y On the Linux clusters, Intel MKL is available.W ays to link to
these libraries are described inSection3 - Porting Issues.
8/8/2019 01-Parallel Computing Explained
108/237
Further Informationy SG I I R IX man and www pages
y man opty man lnoy man inliney man ipay man perfexy Performance Tuning for the Origin2000 at
http://www.ncsa.uiuc.edu/UserInfo/Resources/Hardware/Origin2000OLD/Doc/
y Linux clusters h elp and www pagesy ifort/icc/icpc help (Intel)y http://www.ncsa.uiuc.edu/UserInfo/Resources/Hardware/Intel64Cluster/
(Intel64)y http://www.ncsa.uiuc.edu/UserInfo/Resources/Hardware/Intel64Cluster/
(Intel64)y http://perfsuite.ncsa.uiuc.edu/OptView/
8/8/2019 01-Parallel Computing Explained
109/237
Agenda1 Parallel Computing Overview2 How to Parallelize a Code3 Porting Issues4 Scalar Tuning5 Parallel Code Tuning
5.1 Sequential Code Limitation5.2 Parallel Overhead
5.3 Load Balance5.3.1 Loop Schedule Types5.3.2 Chunk Size
8/8/2019 01-Parallel Computing Explained
110/237
Parallel Code Tuning y This chapter describes several of the most common
techniques for parallel tuning, the type of programs that benefit, and the details for implementing them.
y The majority of this chapter deals with improving load balancing.
8/8/2019 01-Parallel Computing Explained
111/237
Sequential Code Limitationy Sequential code is a part of the program that cannot be run withmultiple processors. Some reasons why it cannot be made data
parallel are:y The code is not in a do loop.y The do loop contains a read or write.y
The do loop contains a dependency.y The do loop has an ambiguous subscript.y The do loop has a call to a subroutine or a reference to a function
subprogram.y S equential Code Fraction
y As shown by Amdahls Law, if the sequential fraction is too large,there is a limitation on speedup. If you think too much sequentialcode is a problem, you can calculate the sequential fraction of codeusing the Amdahls Law formula.
8/8/2019 01-Parallel Computing Explained
112/237
Sequential Code Limitationy M easuring t h e S equential Code Fraction
y Decide how many processors to use, this is p.y Run and time the program with1 processor to give T(1 ).y Run and time the program with p processors to give T(2).y Form a ratio of the 2 timings T(1 )/T(p), this is SP.y Substitute SP and p into the Amdahls Law formula:
y f=(1/ S P-1/p)/(1-1/p) , where f is the fraction of sequential code.y Solve for f, this is the fraction of sequential code.
y Decreasing t h e S equential Code Fractiony The compilation optimization reports list which loops could not be
parallelized and why. You can use this report as a guide to improveperformance on do loops by:
y Removing dependenciesy Removing I/Oy Removing calls to subroutines and function subprograms
8/8/2019 01-Parallel Computing Explained
113/237
Parallel Overheady Parallel overhead is the processing time spent
y creating threadsy spin/blocking threadsy starting and ending parallel regionsy synchronizing at the end of parallel regionsy W hen the computational work done by the parallel processes is too
small, the overhead time needed to create and control the parallelprocesses can be disproportionately large limiting the savings due toparallelism.
y M easuring Parallel Over h eady To get a rough under-estimate of parallel overhead:
y Run and time the code using1 processor.y Parallelize the code.y Run and time the parallel code using only1 processor.y Subtract the 2 timings.
8/8/2019 01-Parallel Computing Explained
114/237
Parallel Overheady R educing Parallel Over h ead
y To reduce parallel overhead:y Don't parallelize all the loops.y Don't parallelize small loops.
y To benefit from parallelization, a loop needs about1 000 floating
point operations or 500 statements in the loop. You can use the IFmodifier in the OpenMP directive to control when loops areparallelized.
!$OMP PARALLEL DO IF( n > 5 00 )do i =1, n ... body of l oop ... end do !$OMP END PARALLEL DO
y Use task parallelism instead of data parallelism. It doesn't generate asmuch parallel overhead and often more code runs in parallel.
y Don't use more threads than you need.y
Parallelize at the highest level possible.
8/8/2019 01-Parallel Computing Explained
115/237
Load Balancey Load balance
y is the even assignment of subtasks to processors so as to keep eachprocessor busy doing useful work for as long as possible.
y Load balance is important for speedup because the end of a do loop isa synchronization point where threads need to catch up with eachother.
y If processors have different work loads, some of the processors willidle while others are still working.
y M easuring Load Balancey On the SGI Origin, to measure load balance, use the perfex tool
which is a command line interface to the R1 0000 hardware counters.
The commandper f ex -e 16 - mp a.out > resu lts y reports per thread cycle counts. Compare the cycle counts to
determine load balance problems. The master thread (thread 0)always uses more cycles than the slave threads. If the counts are vastlydifferent, it indicates load imbalance.
8/8/2019 01-Parallel Computing Explained
116/237
Load Balancey For linux systems, the thread cpu times can be compared
with ps. A thread with unusually high or low time comparedto the others may not be working efficiently [high cputimecould be the result of a thread spinning while waiting for
other threads to catch up].ps uH
y Improving Load Balancey To improve load balance, try changing the way that loop
iterations are allocated to threads byy changing the loop schedule typey changing the chunk sizey These methods are discussed in the following sections.
8/8/2019 01-Parallel Computing Explained
117/237
Loop Schedule Typesy On the SGI Origin2000 computer, 4 different loop schedule
types can be specified by an OpenMP directive. They are:y Staticy Dynamicy Guidedy Runtime
y If you don't specify a schedule type, the default will be used.y Default S ch edule Type
y The default schedule type allocates 20 iterations on 4 threads as:
8/8/2019 01-Parallel Computing Explained
118/237
Loop Schedule Typesy S tatic S ch edule Type
y The static schedule type is used when some of the iterations do morework than others.W ith the static schedule type, iterations areallocated in a round-robin fashion to the threads.
y A n Exampley Suppose you are computing on theupper triangle of a1 00 x 1 00
matrix, and you use 2 threads,named t0 and t1 . W ith defaultscheduling, workloads are uneven.
8/8/2019 01-Parallel Computing Explained
119/237
Loop Schedule Typesy W hereas with static scheduling, the columns of the matrix
are given to the threads in a round robin fashion, resulting in better load balance.
8/8/2019 01-Parallel Computing Explained
120/237
Loop Schedule Typesy Dynamic S ch edule Type
y The iterations are dynamically allocated to threads at runtime. Eachthread is given a chunk of iterations.W hen a thread finishes its work,it goes into a critical section where its given another chunk of iterations to work on.
y
This type is useful when you dont know the iteration count or workpattern ahead of time. Dynamic gives good load balance, but at a highoverhead cost.
y G uided S ch edule Typey The guided schedule type is dynamic scheduling that starts with large
chunks of iterations and ends with small chunks of iterations. That is,the number of iterations given to each thread depends on the numberof iterations remaining. The guided schedule type reduces the numberof entries into the critical section, compared to the dynamic scheduletype. Guided gives good load balancing at a low overhead cost.
8/8/2019 01-Parallel Computing Explained
121/237
Chunk Sizey
The word chunk refers to a grouping of iterations.C hunk sizemeanshow many iterations are in the grouping. The static and dynamicschedule types can be used with a chunk size. If a chunk size is notspecified, then the chunk size is1 .
y Suppose you specify a chunk size of 2 with the static schedule type.Then 20 iterations are allocated on 4 threads:
y The schedule type and chunk size are specified as follows:!$OMP PARALLEL DO SCHEDULE( type , ch un k)!$OMP END PARALLEL DO
y W here typeis STATIC, or DYNAMIC, or GUIDED andchunkis anypositive integer.
8/8/2019 01-Parallel Computing Explained
122/237
Agenda1 Parallel Computing Overview2 How to Parallelize a Code3 Porting Issues4 Scalar Tuning5 Parallel Code Tuning
6 Timing and Profiling6.1 Timing
6.1 .1 Timing a Section of Code6.1 .1 .1 CPU Time6.1 .1 .2 W all clock Time
6.1 .2 Timing an Executable
6.1
.3 Timing a Batch Job6.2 Profiling6.2.1 Profiling Tools6.2.2 Profile Listings6.2.3 Profiling Analysis
6.3 Further Information
8/8/2019 01-Parallel Computing Explained
123/237
Timing and Profiling y Now that your program has been ported to the new
computer, you will want to know how fast it runs.y This chapter describes how to measure the speed of a
program using various timing routines.y The chapter also covers how to determine which parts of the
program account for the bulk of the computational load sothat you can concentrate your tuning efforts on thosecomputationally intensive parts of the program.
8/8/2019 01-Parallel Computing Explained
124/237
Timing y In the following sections, well discuss timers and review theprofiling tools ssrun and prof on the Origin and vprof and gprof
on the Linux Clusters. The specific timing functions described are:y Timing a section of code
F ORT RA N y
etime, dtime, cpu_time for CPU timey time and f_time for wallclock timeC
y clock for CPU timey gettimeofday for wallclock time
y
Timing an executabley time a.outy Timing a batch run
y busagey qstaty qhist
8/8/2019 01-Parallel Computing Explained
125/237
CPU Timey etime
y A section of code can be timed using etime.y It returns the elapsed CPU time in seconds since the program
started.
rea l*4 tarray (2), time 1, time 2, timere s beginning of program time 1= etime ( tarray ) s tart of s e c tion of c ode to be timed l ot s of c omputation
end of s e c tion of c ode to be timed time 2= etime ( tarray )timere s= time 2- time 1
8/8/2019 01-Parallel Computing Explained
126/237
CPU Timey dtime
y A section of code can also be timed using dtime.y It returns the elapsed CPU time in seconds since the last call to
dtime.
rea l*4 tarray (2), timere s beginning of program timere s= dtime ( tarray ) s tart of s e c tion of c ode to be timed l ot s of c omputation
end of s e c tion of c ode to be timed timere s= dtime ( tarray ) re s t of program
8/8/2019 01-Parallel Computing Explained
127/237
CPU TimeT h e etime and dtime Functionsy U ser time.
y This is returned as the first element of tarray.y Its the CPU time spent executing user code.
y S ystem time.y This is returned as the second element of tarray.y Its the time spent executing system calls on behalf of your program.
y Sum of user and system time.y This is the function value that is returned.y
Its the time that is usually reported.y Metric.
y Timings are reported in seconds.y Timings are accurate to1 / 1 00th of a second.
8/8/2019 01-Parallel Computing Explained
128/237
8/8/2019 01-Parallel Computing Explained
129/237
CPU Timecpu_timey The cpu_time routine is available only on the Linux clusters as it is
a component of the Intel FORTRAN compiler library.y It provides substantially higher resolution and has substantially
lower overhead than the older etime and dtime routines.y
It can be used as an elapsed timer.rea l*8 time 1, time 2, timere s beginning of program c a ll c pu _ time ( time 1) s tart of s e c tion of c ode to be timed l ot s of c omputation end of s e c tion of c ode to be timed c a ll c pu _ time ( time 2)timere s= time 2- time 1 re s t of program
8/8/2019 01-Parallel Computing Explained
130/237
CPU Timeclocky For C programmers, one can call the cpu_time routine using a
FORTRAN wrapper or call the intrinsic function clock that can beused to determine elapsed CPU time.
in cl ude s tati c c on s t doub l e i CPS =
1 .0 /( doub l e )CLOCKS_PER_SEC;doub l e time 1, time 2, timre s;time 1=(cl o ck()* i CPS);/* do s ome wor k */time 2=(cl o ck()* i CPS);timer s= time 2- time 1;
8/8/2019 01-Parallel Computing Explained
131/237
Wall clock Timetimey For the Origin, the functiontimereturns the time since
00:00:00 GMT, Jan.1 , 1 970.y It is a means of getting the elapsed wall clock time.y
The wall clock time is reported in integer seconds.external time integer*4 time1 ,time2,timeres
beginning of program time 1= time ( )
s tart of s e c tion of c ode to be timed l ot s of c omputation end of s e c tion of c ode to be timed time 2= time ( )timere s= time 2 - time 1
ll l k
8/8/2019 01-Parallel Computing Explained
132/237
Wall clock Timef_timey For the Linux clusters, the appropriate FORTRAN function for elapsed
time is f_time.
integer *8 f _ timeexterna l f _ time
integer *8 time 1, time 2, timere s beginning of program time 1= f _ time () s tart of s e c tion of c ode to be timed l ot s of c omputation end of s e c tion of c ode to be timed time 2= f _ time ()timere s= time 2 - time 1
y As above for etime and dtime, the f_time function is in the VAXcompatibility library of the Intel FORTRAN Compiler. To use thislibrary include the compiler flag -Vaxlib.
W ll l k Ti
8/8/2019 01-Parallel Computing Explained
133/237
Wall clock Timegettimeofdayy For C programmers, wallclock time can be obtained by using the very
portable routine gettimeofday.
# in cl ude /* definition of NULL */# in cl ude /* definition of timeva l s tru c t and
protyping of gettimeofday */doub l e t 1, t 2, e l ap s ed ;s tru c t timeva l tp ;int rtn ;.... .... rtn =gettimeofday (& tp , NULL);
t 1=( doub l e ) tp.tv _s e c+(1 .e -6)* tp.tv _ u s e c;.... /* do s ome wor k */.... rtn =gettimeofday (& tp , NULL);t 2=( doub l e ) tp.tv _s e c+(1 .e -6)* tp.tv _ u s e c;e l ap s ed =t 2- t 1;
Ti i E bl
8/8/2019 01-Parallel Computing Explained
134/237
Timing an Executabley To time an executable (if using a csh or tcsh shell, explicitly
call /usr/bin/time)
time option s a.out
y whereoptionscan be -p for a simple output or -f f ormat which allows the user to display more than just time relatedinformation.
y
Consult the man pages on the time command for formatoptions.
Ti i B h J b
8/8/2019 01-Parallel Computing Explained
135/237
Timing a Batch Joby Time of a batch job running or completed.
y Originbu s age j obid
y Linux clustersq s tat j obid # for a running j ob q h i s t j obid # for a c omp l eted j ob
A d
8/8/2019 01-Parallel Computing Explained
136/237
Agenda1
Parallel Computing Overview2 How to Parallelize a Code3 Porting Issues4 Scalar Tuning5 Parallel Code Tuning
6 Timing and Profiling6.1 Timing6.1 .1 Timing a Section of Code
6.1 .1 .1 CPU Time6.1 .1 .2 W all clock Time
6.1 .2 Timing an Executable
6.1
.3 Timing a Batch Job6.2 Profiling6.2.1 Profiling Tools6.2.2 Profile Listings6.2.3 Profiling Analysis
6.3 Further Information
P fili
8/8/2019 01-Parallel Computing Explained
137/237
Profiling y Profiling determines where a program spends its time.
y It detects the computationally intensive parts of the code.
y Use profiling when you want to focus attention andoptimization efforts on those loops that are responsible forthe bulk of the computational load.
y Most codes follow the9 0 -10 R ule.y That is, 90% of the computation is done in1 0% of the code.
P fili T l
8/8/2019 01-Parallel Computing Explained
138/237
Profiling ToolsProfiling T ools on t h e Originy On the SGI Origin2000 computer there are profiling tools named
ssrun and prof.y Used together they do profiling, or what is called hot spot analysis.y They are useful for generating timing profiles.
y
ssruny The ssrun utility collects performance data for an executable that youspecify.
y The performance data is written to a file named"executablename.exptype.id".
y prof y The prof utility analyzes the data file created by ssrun and produces a
report.y Example
ss run - fp cs amp a.outprof -h a.out.fp cs amp.m 12345 > prof. l i s t
P fili T l
8/8/2019 01-Parallel Computing Explained
139/237
Profiling ToolsProfiling T ools on t h e Linux Clustersy On the Linux clusters the profiling tools are still maturing. There are
currently several efforts to produce tools comparable to the ssrun,prof and perfex tools. .
y gprof y Basic profiling information can be generated using the OS utility gprof.y First, compile the code with the compiler flags -qp -g for the Intel
compiler (-g on the Intel compiler does not change the optimizationlevel) or -pg for the GNU compiler.
y Second, run the program.y Finally analyze the resulting gmon.out file using the gprof utility: gprof
exe c utab l e gmon.out.
ef c -O - q p - g - o foo foo.f. / foogprof foo gmon.out
P fili g T l
8/8/2019 01-Parallel Computing Explained
140/237
Profiling Tools
Profiling T ools on t h e Linux Clustersy vprof
y On the IA32 platform there is a utility called vprof that providesperformance information using the PAPI instrumentation
library.y To instrument the whole application requires recompiling and
linking to vprof and PAPI libraries.
s etenv VMON PAPI_TOT_CYCif c - g -O - o md md.f
/ u s r / app s/ too ls/ vprof /l ib / vmonauto _ g cc .o -L/ u s r / app s/ too ls/l ib -l vmon -l papi
. / md / u s r / app s/ too ls/ vprof / bin /c prof - e md vmon.out
P fil Li ti g
8/8/2019 01-Parallel Computing Explained
141/237
Cy cl e s % C um % S e cs P ro c-------- ----- ----- ---- ----4263 09 84 58 . 47 58 . 47 0. 57 VSUB
64 9 82 9 4 8 .9 1 67 . 38 0.09 PFSOR6141611 8 . 42 75 . 81 0.0 8 PBSOR365412 0 5 .0 1 8 0. 82 0.0 5 PFSOR1261586 0 3 . 5 9 84 . 41 0.0 3 VADD158 0 424 2 . 17 86 . 57 0.0 2 ITSRCG1144 0 36 1 . 57 88 . 14 0.0 2 ITSRSI
886 0 44 1 . 22 8 9. 36 0.0 1 ITJSI
861136 1 . 18 90. 54 0.0 1 ITJCG
Profile Listings
Profile Listings on t h e Originy Prof Output First Listing
y The first listing gives the number of cycles executed in eachprocedure (or subroutine). The procedures are listed indescending order of cycle count.
Profile Listings
8/8/2019 01-Parallel Computing Explained
142/237
Cy cl e s % C um % L ine Pro c-------- ----- ----- ---- ----36556 9 44 5 0. 14 5 0. 14 81 0 6 VSUB
53131 9 8 7 . 2 9 57 . 43 6 9 74 PFSOR4 9 688 0 4 6 . 81 64 . 24 6671 PBSOR2 9 8 9 882 4 . 1 0 68 . 34 81 0 7 VSUB2564544 3 . 52 71 . 86 7 09 7 PFSOR11 9 8842 0 2 . 73 74 . 5 9 81 0 3 VSUB162 9 776 2 . 24 76 . 82 8 0 45 VADD
99 421 0 1 . 36 78 . 1 9 81 0 8 VSUB9 6 90 56 1 . 33 7 9. 52 8 0 4 9 VADD483 0 18 0. 66 8 0. 18 6 9 72 PFSOR
Profile Listings
Profile Listings on t h e Originy Prof Output S econd Listing
y The second listing gives the number of cycles per sourcecode line.
y The lines are listed in descending order of cycle count.
Profile Listings
8/8/2019 01-Parallel Computing Explained
143/237
Fl at profi l e :
Ea ch s amp l e c ount s a s 0.0009 76562 s e c ond s .% c umu l ative s e l f s e l f tota l
time s e c ond s s e c ond s c a lls u s/c a ll u s/c a ll name
----- ---------- ------- ----- ------- ------- -----------38 .0 7 5 . 67 5 . 67 1 0 1 56157 . 18 1 0 745 0. 88 c ompute _34 . 72 1 0. 84 5 . 17 251 99 5 00 0. 21 0. 21 di s t _25 . 48 14 . 64 3 . 8 0 SIND_SINCOS
1 . 25 14 . 83 0. 1 9 s in0. 37 14 . 88 0.0 6 c o s0.0 5 14 . 8 9 0.0 1 5 0 5 00 0. 15 0. 15 dotr 8_0.0 5 14 .90 0.0 1 1 00 68 . 36 68 . 36 update _0.0 1 14 .90 0.00 f _ fioinit0.0 1 14 .90 0.00 f _ intorange0.0 1 14 .90 0.00 mov0.00 14 .90 0.00 1 0.00 0.00 initia l ize _
Profile Listings
Profile Listings on t h e Linux Clustersy gprof Output First Listing
y The listing gives a 'flat' profile of functions and routinesencountered, sorted by 'self seconds' which is the number of seconds accounted for by this function alone.
Profile Listings
8/8/2019 01-Parallel Computing Explained
144/237
Ca ll grap h:
index % time s e l f ch i l dren c a ll ed name----- ------ ---- -------- ---------------- ----------------[1] 72 .9 0.00 1 0. 86 main [1]
5 . 67 5 . 18 1 0 1/1 0 1 c ompute _ [2]0.0 1 0.00 1 00 /1 00 update _ [8]0.00 0.00 1/1 initia l ize _ [12]
---------------------------------------------------------------------5 . 67 5 . 18 1 0 1/1 0 1 main [1]
[2] 72 . 8 5 . 67 5 . 18 1 0 1 c ompute _ [2]5 . 17 0.00 251 99 5 00 /251 99 5 00 di s t _ [3]0.0 1 0.00 5 0 5 00 /5 0 5 00 dotr 8_ [7]
---------------------------------------------------------------------5 . 17 0.00 251 99 5 00 /251 99 5 00 c ompute _ [2]
[3] 34 . 7 5 . 17 0.00 251 99 5 00 di s t _ [3]---------------------------------------------------------------------
[4] 25 . 5 3 . 8 0 0.00 SIND_SINCOS [4]
Profile ListingsProfile Listings on t
h
e Linux Clustersy gprof Output S econd Listing
y The second listing gives a 'call-graph' profile of functions and routines encountered. Thdefinitions of the columns are specific to the line in question. Detailed information is
contained in the full output from gprof.
Profile Listings
8/8/2019 01-Parallel Computing Explained
145/237
Co l umn s c orre s pond to t h e fo ll o wing event s:PAPI_TOT_CYC - T ota l c y cl e s (1 9 56 event s)
F i l e Summary :1 00.0 % / u / n cs a / gbauer / temp / md.f
Fun c tion Summary :84 . 4% c ompute15 . 6% di s t
Line Summary :67 . 3% / u / n cs a / gbauer / temp / md.f :1 0 613 . 6% / u / n cs a / gbauer / temp / md.f :1 0 4
9. 3% / u / n cs a / gbauer / temp / md.f :1662 . 5% / u / n cs a / gbauer / temp / md.f :1651 . 5% / u / n cs a / gbauer / temp / md.f :1 0 21 . 2% / u / n cs a / gbauer / temp / md.f :1640.9 % / u / n cs a / gbauer / temp / md.f :1 0 70. 8% / u / n cs a / gbauer / temp / md.f :16 90. 8% / u / n cs a / gbauer / temp / md.f :1620. 8% / u / n cs a / gbauer / temp / md.f :1 0 5
Profile ListingsProfile Listings on t
he Linux Clusters
y vprof Listing
y The above listing from (using the -e option to cprof), displays not only cycles consumed byfunctions (a flat profile) but also the lines in the code that contribute to those functions.
8/8/2019 01-Parallel Computing Explained
146/237
Profiling Analysis
8/8/2019 01-Parallel Computing Explained
147/237
Profiling Analysisy
The program being analyzed in the previous Origin example hasapproximately1 0000 source code lines, and consists of manysubroutines.
y The first profile listing shows that over 50% of the computation is doneinside the VSUB subroutine.
y The second profile listing shows that line 81 06 in subroutine VSUBaccounted for 50% of the total computation.
y Going back to the source code, line 81 06 is a line inside a do loop.y Putting an OpenMP compiler directive in front of that do loop you can get
50% of the program to run in parallel with almost no work on your part.y Since the compiler has rearranged the source lines the line numbers
given by ssrun/prof give you an area of the code to inspect.y
To view the rearranged source use the optionf90 -FLIST:=ONcc -CLIST:=ON
y For the Intel compilers, the appropriate options areifort E i cc -E
Further Information
8/8/2019 01-Parallel Computing Explained
148/237
Further Informationy SG
I Irixy man etimey man 3 timey man1 timey man busagey man timersy man ssruny man prof y Origin2000 Performance Tuning and Optimization Guide
y Linux Clustersy man 3 clocky man 2 gettimeofdayy man1 timey man1 gprof y man1 B qstaty Intel Compilers Vprof on NCSA Linux Cluster
Agenda
8/8/2019 01-Parallel Computing Explained
149/237
Agenda1 Parallel Computing Overview2 How to Parallelize a Code3 Porting Issues4 Scaler Tuning5 Parallel Code Tuning6 Timing and Profiling7 Cache Tuning
8 Parallel Performance Analysis9 About the IBM Regatta P690
Agenda
8/8/2019 01-Parallel Computing Explained
150/237
Agenda
7 Cache Tuning7.1 Cache Concepts7.1 .1 Memory Hierarchy7.1 .2 Cache Mapping7.1 .3 Cache Thrashing7.1 .4 Cache Coherence
7.2 Cache Specifics7.3 Code 0ptimization7.4 Measuring Cache Performance7.5 Locating the Cache Problem7.6 Cache Tuning Strategy7.7 Preserve Spatial Locality
7.8 Locality Problem7.9 Grouping Data Together7.1 0 Cache Thrashing Example7.11 Not Enough Cache7.1 2 Loop Blocking7.1 3 Further Information
Cache Concepts
8/8/2019 01-Parallel Computing Explained
151/237
Cache Conceptsy
The CPU time required to perform an operation is the sum of theclock cycles executing instructions and the clock cycles waitingfor memory.
y The CPU cannot be performing useful work if it is waiting fordata to arrive from memory.
y
Clearly then, the memory system is a major factor in determiningthe performance of your program and a large part is your use of the cache.
y The following sections will discuss the key concepts of cacheincluding:y Memory subsystem hierarchyy Cache mappingy Cache thrashingy Cache coherence
Memory Hierarchy
8/8/2019 01-Parallel Computing Explained
152/237
Memory Hierarchyy
The different subsystems in the memory hierarchy have differentspeeds, sizes, and costs.
y Smaller memory is fastery Slower memory is cheaper
y The