01-Parallel Computing Explained

8/8/2019 01-Parallel Computing Explained

1/237


2/237

Agenda1 Parallel Computing Overview2 How to Parallelize a Code3 Porting Issues

4 Scalar Tuning5 Parallel Code Tuning6 Timing and Profiling7 Cache Tuning8 Parallel Performance Analysis9 About the IBM Regatta P690


3/237

Agenda1 Parallel Computing Overview

1 .1 Introduction to Parallel Computing1 .1 .1 Parallelism in our Daily Lives1 .1 .2 Parallelism in Computer Programs1 .1 .3 Parallelism in Computers1 .1 .3.4 Disk Parallelism1 .1 .4 Performance Measures1 .1 .5 More Parallelism Issues

1.2 Comparison of Parallel Computers

1 .3 Summary


4/237

Parallel Computing Overviewy W ho should read this chapter?

y New Users to learn concepts and terminology.y Intermediate Users for review or reference.y

Management Staff to understand the basic concepts even if you dont plan to do any programming.y Note: Advanced users may opt to skip this chapter.


5/237

Introduction to Parallel Computing y High performance parallel computers

y can solve large problems much faster than a desktop computery fast CPUs, large memory, high speed interconnects, and high speed

input/outputy

able to speed up computationsy by making the sequential components run fastery by doing more operations in parallel

y High performance parallel computers are in demandy need for tremendous computational capabilities in science,

engineering, and business.y require gigabytes/terabytes f memory and gigaflops/teraflops of

performancey scientists are striving for petascale performance


6/237

Introduction to Parallel Computing y HPPC are used in a wide variety of disciplines.

y Meteorologists: prediction of tornadoes and thunderstormsy Computational biologists: analyze DNA sequencesy Pharmaceutical companies: design of new drugsy Oil companies: seismic explorationy W all Street: analysis of financial marketsy NASA: aerospace vehicle designy Entertainment industry: special effects in movies and

commercialsy These complex scientific and business applications all need to

perform computations on large datasets or large equations.


7/237

Parallelism in our Daily Livesy There are two types of processes that occur in computers and

in our daily lives:y Sequential processes

y occur in a strict ordery it is not possible to do the next step until the current one is completed.y Examples

y The passage of time: the sun rises and the sun sets.y W riting a term paper: pick the topic, research, and write the paper.

y Parallel processesy many events happen simultaneouslyy Examples

y Plant growth in the springtimey An orchestra


8/237


1 .1 Introduction to Parallel Computing1 .1 .1 Parallelism in our Daily Lives1 .1 .2 Parallelism in Computer Programs

1 .1 .2.1 Data Parallelism1 .1 .2.2 Task Parallelism

1 .1 .3 Parallelism in Computers1 .1 .3.4 Disk Parallelism1

.1

.4 Performance Measures1 .1 .5 More Parallelism Issues1 .2 Comparison of Parallel Computers1 .3 Summary


9/237

Parallelism in Computer Programsy Conventional wisdom:

y Computer programs are sequential in naturey Only a small subset of them lend themselves to parallelism.y Algorithm: the "sequence of steps" necessary to do a computation.y The first 30 years of computer use, programs were run sequentially.

y The 1 980's saw great successes with parallel computers.y Dr. Geoffrey Fox published a book entitled Parallel Computing

W orks!y many scientific accomplishments resulting from parallel computingy Computer programs are parallel in naturey Only a small subset of them need to be run sequentially


10/237

Parallel Computing y W hat a computer does when it carries out more than one

computation at a time using more than one processor.y By using many processors at once, we can speedup the execution

y If one processor can perform the arithmetic in time t.y Then ideally p processors can perform the arithmetic in time t/p.y W hat if I use1 00 processors?W hat if I use1 000 processors?

y Almost every program has some form of parallelism.y You need to determine whether your data or your program can be

partitioned into independent pieces that can be run simultaneously.y Decomposition is the name given to this partitioning process.

y Types of parallelism:y data parallelismy task parallelism.


11/237

Data Parallelismy The same code segment runs concurrently on each processor,

but each processor is assigned its own part of the data towork on.y Do loops (in Fortran) define the parallelism.y The iterations must be independent of each other.y Data parallelism is called "fine grain parallelism" because the

computational work is spread into many small subtasks.y Example

y Dense linear algebra, such as matrix multiplication, is a perfectcandidate for data parallelism.


12/237

An example of data parallelism

Original Sequential Code Parallel Code

DO K=1,NDO J=1,NDO I=1,NC(I,J) = C(I,J) +A(I,K)*B(K,J)END DOEND DOEND DO

!$OMP PARALLEL DO

DO K=1,NDO J=1,NDO I=1,NC(I,J) = C(I,J) +A(I,K)*B(K,J)END DOEND DOEND DO!$END PARALLEL DO


13/237

Quick Intro to OpenMPy OpenMP is a portable standard for parallel directives

covering both data and task parallelism.y More information about OpenMP is available on theOpenMP

website.y W e will have a lecture onIntroduction to OpenMP later.

y W ith OpenMP, the loop that is performed in parallel is theloop that immediately follows the Parallel Do directive.y In our sample code, it's the K loop:

y DO K=1,N


14/237

OpenMP Loop ParallelismIteration-ProcessorAssignments

The code segment running on each processor

DO J=1,NDO I=1,NC(I,J) = C(I,J) +

A(I,K)*B(K,J)END DOEND DO

Processor Iterations

of K

Data

Elementsproc0 K=1 :5 A(I,

1 :5)B(1 :5 ,J)

proc1 K=6:1 0 A(I, 6:1 0)

B(6:1 0 ,J)

proc2 K=11 :1 5 A(I,11 :1 5)B(11 :1 5 ,J)

proc3 K=1 6:20 A(I,1 6:20)

B(1 6:20 ,J)


15/237

OpenMP Style of Parallelismy can be done incrementally as follows:

1 . Parallelize the most computationally intensive loop.2. Compute performance of the code.3.

If performance is not satisfactory, parallelize another loop.4. Repeat steps 2 and 3 as many times as needed.y The ability to performincremental parallelismis considered a

positive feature of data parallelism.y

It is contrasted with the MPI (Message Passing Interface)style of parallelism, which is an "all or nothing" approach.


16/237

Task Parallelismy Task parallelism may be thought of as the opposite of data

parallelism.y Instead of the same operations being performed on different parts

of the data, each process performs different operations.y You can use task parallelism when your program can be split into

independent pieces, often subroutines, that can be assigned todifferent processors and run concurrently.

y Task parallelism is called "coarse grain" parallelism because thecomputational work is spread into just a few subtasks.

y More code is run in parallel because the parallelism isimplemented at a higher level than in data parallelism.

y Task parallelism is often easier to implement and has less overheadthan data parallelism.


17/237

Task Parallelismy The abstract code shown in the diagram is decomposed into

4 independent code segments that are labeled A, B, C, and D.The right hand side of the diagram illustrates the 4 codesegments running concurrently.


18/237

Task Parallelism

Original Code Parallel Codeprogram main

c ode s egment l abe l ed A

c ode s egment l abe l ed B

c ode s egment l abe l ed C

c ode s egment l abe l ed D

end

program main

c ode s egment l abe l ed A

c ode s egment l abe l ed B

c ode s egment l abe l ed C

c ode s egment l abe l ed D

end

program main !$OMP PARALLEL!$OMP SECTIONSc ode s egment l abe l ed A!$OMP SECTIONc ode s egment l abe l ed B!$OMP SECTIONc ode s egment l abe l ed C

!$OMP SECTIONc ode s egment l abe l ed D!$OMP END SECTIONS!$OMP END PARALLELend


19/237

OpenMP Task Parallelismy W ith OpenMP, the code that follows each SECTION(S)

directive is allocated to a different processor. In our sampleparallel code, the allocation of code segments to processors isas follows.

Processor Code

proc0 code segmentlabeled A

proc1 code segment

labeled B

proc2 code segmentlabeled C

proc3 code segmentlabeled D


20/237

Parallelism in Computersy How parallelism is exploited and enhanced within the

operating system and hardware components of a parallelcomputer:y operating systemy arithmeticy memoryy disk


21/237

Operating System Parallelismy All of the commonly used parallel computers run a version of the

Unix operating system. In the table below each OS listed is in factUnix, but the name of the Unix OS varies with each vendor.

y For more information about Unix, a collection of Unix documentsis available.

Parallel Computer OS

SGI Origin2000 IRIX

HP V-Class HP-UX

Cray T3E Unicos

IBM SP AIX

W orkstationClusters Linux


22/237

Two Unix Parallelism Featuresy background processing facility

y W ith the Unix background processing facility you can run theexecutablea.out in the background and simultaneously view theman page for theetimefunction in the foreground. There are

two Unix commands that accomplish this:

a.out > re s u l t s &man etime

y cron featurey W ith the Unix cron feature you can submit a job that will run at

a later time.


23/237

Arithmetic Parallelismy Multiple execution units

y facilitate arithmetic parallelism.y The arithmetic operations of add, subtract, multiply, and divide (+ - * /) are

each done in a separate execution unit. This allows several execution units to beused simultaneously, because the execution units operate independently.

y F used multiply and add y is another parallel arithmetic feature.y Parallel computers are able to overlap multiply and add. This arithmetic is named

MultiplyADD (MADD) on SGI computers, and Fused Multiply Add (FMA) onHP computers. In either case, the two arithmetic operations are overlapped andcan complete in hardware in one computer cycle.

y Superscalar arithmeticy is the ability to issue several arithmetic operations per computer cycle.y It makes use of the multiple, independent execution units. On superscalar

computers there are multiple slots per cycle that can be filled with work. Thisgives rise to the name n-way superscalar, where n is the number of slots percycle. TheSGI Origin2000is called a 4-way superscalar computer.


24/237

Memory Parallelismy memory interleaving

y memory is divided into multiple banks, and consecutive data elements areinterleaved among them. For example if your computer has 2 memory banks,then data elements with even memory addresses would fall into one bank, anddata elements with odd memory addresses into the other.

y multiple memory portsy Port means a bi-directional memory pathway.W hen the data elements that are

interleaved across the memory banks are needed, the multiple memory portsallow them to be accessed and fetched in parallel, which increases the memory bandwidth (MB/s or GB/s).

y multiple levels of the memory hierarchy y There is global memory that any processor can access. There is memory that is

local to a partition of the processors. Finally there is memory that is local to asingle processor, that is, the cache memory and the memory elements held inregisters.

y C ache memory y C acheis a small memory that has fast access compared with the larger main

memory and serves to keep the faster processor filled with data.


25/237

Memory Parallelism

Memory Hierarchy Cache Memory


26/237

Disk Parallelismy RA ID (R edundant Array of I nexpensiveDisk)

y RAID disks are on most parallel computers.y The advantage of a RAID disk system is that it provides a

measure of fault tolerance.y If one of the disks goes down, it can be swapped out, and the

RAID disk system remains operational.y DiskS triping

y W hen a data set is written to disk, it is striped across the RAIDdisk system. That is, it is broken into pieces that are writtensimultaneously to the different disks in the RAID disk system.W hen the same data set is read back in, the pieces are read inparallel, and the full data set is reassembled in memory.


27/237


1 .1 Introduction to Parallel Computing1 .1 .1 Parallelism in our Daily Lives1 .1 .2 Parallelism in Computer Programs1 .1 .3 Parallelism in Computers1 .1 .3.4 Disk Parallelism1 .1 .4 Performance Measures1 .1 .5 More Parallelism Issues

1

.2 Comparison of Parallel Computers1 .3 Summary


28/237

Performance Measuresy Peak Performance

y is the top speed at which the computer can operate.y It is a theoretical upper limit on the computer's performance.

y Sustained Performancey is the highest consistently achieved speed.y

It is a more realistic measure of computer performance.y C ost Performancey is used to determine if the computer is cost effective.

y MHzy is a measure of the processor speed.y The processor speed is commonly measured in millions of cycles per second,

where a computer cycle is defined as the shortest time in which some work can bedone.y MIP S

y is a measure of how quickly the computer can issue instructions.y Millions of instructions per second is abbreviated as MIPS, where the instructions

are computer instructions such as: memory reads and writes, logical operations ,floating point operations, integer operations, and branch instructions.


29/237

Performance Measuresy Mflops(Millions of floating point operations per second)

y measures how quickly a computer can perform floating-point operationssuch as add, subtract, multiply, and divide.

y S peedupy measures the benefit of parallelism.y

It shows how your program scales as you compute with more processors,compared to the performance on one processor.y Ideal speedup happens when the performance gain is linearly proportional to

the number of processors used.y Benchmarks

y are used to rate the performance of parallel computers and parallel

programs.y A well known benchmark that is used to compare parallel computers is theLinpack benchmark.

y Based on the Linpack results, a list is produced of theTop 500Supercomputer Sites. This list is maintained by the University of Tennesseeand the University of Mannheim.


30/237

More Parallelism Issuesy Load balancing

y is the technique of evenly dividing the workload among the processors.y For data parallelism it involves how iterations of loops are allocated to processoy Load balancing is important because the total time for the program to complete

the time spent by the longest executing thread.y

The problem sizey must be large and must be able to grow as you compute with more processors.y In order to get the performance you expect from a parallel computer you need t

run a large application with large data sizes, otherwise the overhead of passinginformation between processors will dominate the calculation time.

y Good software toolsy are essential for users of high performance parallel computers.y These tools include:

y parallel compilersy parallel debuggersy performance analysis toolsy parallel math software

y

The availability of a broad set of application software is also important.


31/237

More Parallelism Issuesy The high performance computing market is risky and chaotic. Many

supercomputer vendors are no longer in business, making theportability of your application very important.

y Aworkstation farmy is defined as a fast network connecting heterogeneous workstations.y The individual workstations serve as desktop systems for their owners.y W hen they are idle, large problems can take advantage of the unused

cycles in the whole system.y An application of this concept is the SETI project. You can participate in

searching for extraterrestrial intelligence with your home PC. Moreinformation about this project is available at theSETI Institute.

y C ondor y is software that provides resource management services for applications that

run on heterogeneous collections of workstations.y Miron Livny at the University of W isconsin at Madison is the director of the

Condor project, and has coined the phrasehigh throughput computingto describethis process of harnessing idle workstation cycles. More information is availableat the Condor Home Page.


32/237


1 .1 Introduction to Parallel Computing1 .2 Comparison of Parallel Computers

1 .2.1 Processors1 .2.2 Memory Organization1 .2.3 Flow of Control1 .2.4 Interconnection Networks

1 .2.4.1 Bus Network1 .2.4.2 Cross-Bar Switch Network1

.2.4.3 Hypercube Network1 .2.4.4 Tree Network1 .2.4.5 Interconnection Networks Self-test

1 .2.5 Summary of Parallel Computer Characteristics1 .3 Summary


33/237

Comparison of Parallel Computersy Now you can explore the hardware components of parallel

computers:y kinds of processorsy types of memory organizationy flow of controly interconnection networks

y You will see what is common to these parallel computers,and what makes each one of them unique.


34/237

Kinds of Processorsy There are three types of parallel computers:

1 . computers with a small number of powerful processorsy Typically have tens of processors.y The cooling of these computers often requires very sophisticated and

expensive equipment, making these computers very expensive for computincenters.

y They are general-purpose computers that perform especially well onapplications that have large vector lengths.

y The examples of this type of computer are theCray SV1 and theFujitsu

VPP5000.


35/237


2. computers with a large number of less powerful processorsy Named aM assivelyP arallelP rocessor (MPP ), typically have thousands of

processors.y The processors are usually proprietary and air-cooled.y Because of the large number of processors, the distance between the furthest

processors can be quite large requiring a sophisticated internal network thatallows distant processors to communicate with each other quickly.

y These computers are suitable for applications with a high degree of

concurrency.y The MPP type of computer was popular in the1 980s.y Examples of this type of computer were theThinking Machines CM-2

computer, and the computers made by the MassPar company.


36/237


3. computers that are medium scale in between the two extremesy Typically have hundreds of processors.y The processor chips are usually not proprietary; rather they are commodity

processors like the Pentium III.y These are general-purpose computers that perform well on a wide range of

applications.y The most common example of this class is the Linux Cluster.


37/237

Trends and Examplesy Processor trends :

y The processors on todays commonly used parallel computers:

Decade Processor Type Computer Example1 970s Pipelined, Proprietary Cray-1

1 980s Massively Parallel, Proprietary Thinking Machines CM2

1 990s Superscalar, RISC, Commodity SGI Origin20002000s CISC, Commodity W orkstation Clusters

Computer Processor

SGI Origin2000 MIPS RISC R1 2000

HP V-Class HP PA 8200

Cray T3E Compaq Alpha

IBM SP IBM Power3

W orkstation Clusters Intel Pentium III, Intel Itanium


38/237

Memory Organizationy The following paragraphs describe the three types of

memory organization found on parallel computers:y distributed memoryy shared memoryy distributed shared memory


39/237

Distributed Memoryy In distributed memory computers, the total memory is partitioned

into memory that is private to each processor.y There is aN on-U niformM emory Access time (NU M A), which is

proportional to the distance between the two communicatingprocessors.

y On NUMA computers,data is accessed thequickest from a privatememory, while data fromthe most distant

processor takes thelongest to access.

y Some examples are theCray T3E, the IBM SP,and workstation clusters.


40/237

Distributed Memoryy W hen programming distributed memory computers, the

code and the data should be structured such that the bulk of a processors data accesses are to its own private (local)memory.

y This is called havinggooddata locality .

y Today's distributedmemory computers use

message passingsuch asMPI to communicate between processors asshown in the followingexample:


41/237

Distributed Memoryy One advantage of distributed memory computers is that they

are easy to scale. As the demand for resources grows,computer centers can easily add more memory andprocessors.y This is often called theLEGO blockapproach.

y The drawback is that programming of distributed memorycomputers can be quite complicated.


42/237

Shared Memoryy In shared memory computers, all processors have access to a single pool

of centralized memory with a uniform address space.y Any processor can address any memory location at the same speed so

there isU niformM emory Access time (UMA ).y Processors communicate with each other through the shared memory.y The advantages and

disadvantages of sharedmemory machines areroughly the opposite of distributed memorycomputers.

y They are easier to program because they resemble theprogramming of singleprocessor machines

y But they don't scale liketheir distributed memorycounterparts


43/237

Distributed Shared Memoryy InDistributedSharedM emory (DSM ) computers, a cluster or partition of

processors has access to a common shared memory.y It accesses the memory of a different processor cluster in a NUMA fashion.y Memory is physically distributed but logically shared.y Attention to data locality again is important.

y Distributed shared memorycomputers combine the bestfeatures of both distributedmemory computers andshared memory computers.

y That is, DSM computers have both the scalability of

distributed memorycomputers and the ease of programming of sharedmemory computers.

y Some examples of DSMcomputers are theSGIOrigin2000and theHP V-Classcomputers.


44/237

Trends and Examplesy Memory organization

trends:

y The memory

organization of todays commonlyused parallelcomputers:

Decade Memory Organization Example1 970s Shared Memory Cray-1

1 980s Distributed Memory Thinking Machines CM-21 990s Distributed Shared Memory SGI Origin2000

2000s Distributed Memory W orkstation Clusters

Computer Memory Organization

SGI Origin2000 DSMHP V-Class DSM

Cray T3E Distributed

IBM SP DistributedW

orkstation Clusters Distributed


45/237

Flow of Controly W hen you look at the control of flow you will see three types

of parallel computers:y SingleI nstructionM ultipleData (SIMD)y M ultipleI nstructionM ultipleData (MIMD)y SingleP rogramM ultipleData (SPMD)


46/237

Flynns Taxonomyy Flynns Taxonomy, devised in1 972 by Michael Flynn of Stanford

University, describes computers by how streams of instructions interactwith streams of data.

y There can be single or multiple instruction streams, and there can besingle or multiple data streams. This gives rise to 4 types of computers asshown in the diagram below:

y Flynn's taxonomynames the 4 computertypes SISD, MISD,SIMD and MIMD.

y

Of these 4, only SIMDand MIMD areapplicable to parallelcomputers.

y Another computertype, SPMD, is a special

case of MIMD.


47/237

S IMD Computersy SIMDstands forSingleI nstructionM ultipleData.y Each processor follows the same set of instructions.y W ith different data elements being allocated to each processor.y SIMD computers have distributed memory with typically thousands of simple processors,

and the processors run in lock step.y SIMD computers, popular in the1 980s, are useful for fine grain data parallel applications,

such as neural networks.y Some examples of SIMD computers

were the Thinking Machines CM-2computer and the computers from theMassPar company.

y The processors are commanded by theglobal controller that sendsinstructions to the processors.

y It saysadd , and they all add.y It saysshift to the right, and they all

shift to the right.y The processors are like obedient

soldiers, marching in unison.


48/237

MIMD Computersy MIMDstands forM ultipleI nstructionM ultipleData.y There are multiple instruction streams with separate code segments distributed

among the processors.y MIMD is actually a superset of SIMD, so that the processors can run the same

instruction stream or different instruction streams.y

In addition, there are multiple data streams; different data elements are allocatedto each processor.y MIMD computers can have either distributed memory or shared memory.y W hile the processors on SIMD

computers run in lock step, theprocessors on MIMD computersrun independently of each other.

y MIMD computers can be used foreither data parallel or task parallelapplications.

y Some examples of MIMDcomputers are theSGI Origin2000computer and theHP V-Classcomputer.


49/237

SPMD Computersy SPMDstands forSingleP rogramM ultipleData.y SPMD is a special case of MIMD.y SPMD execution happens when a MIMD computer is programmed to have the

same set of instructions per processor.y W ith SPMD computers, while the processors are running the same code

segment, each processor can run that code segment asynchronously.y Unlike SIMD, the synchronous execution of instructions is relaxed.y An example is the execution of an if statement on a SPMD computer.

y Because each processor computes with its own partition of the data elements, itmay evaluate the right hand side of the if statement differently from anotherprocessor.

y One processor may take a certain branch of the if statement, and anotherprocessor may take a different branch of the same if statement.

y Hence, even though each processor has the same set of instructions, thoseinstructions may be evaluated in a different order from one processor to the next.

y The analogies we used for describing SIMD computers can be modified forMIMD computers.

y Instead of the SIMD obedient soldiers, all marching in unison, in the MIMD worldthe processors march to the beat of their own drummer.


50/237

Summary of S IMD versus M IMDSIMD MIMD

Memory distributed memorydistriuted memory

orshared memory

Code Segment same perprocessorsameor

different

ProcessorsRun In lock step asynchronously

DataElements

different perprocessor

different perprocessor

Applications data paralleldata parallel

ortask parallel


51/237

Trends and Examplesy Flow of control trends:

y The flow of control on today:

Decade Flow of Control Computer Example1 980's SIMD Thinking Machines CM-21 990's MIMD SGI Origin2000

2000's MIMD W orkstation Clusters

Computer Flow of Control

SGI Origin2000 MIMDHP V-Class MIMD

Cray T3E MIMD

IBM SP MIMD

W orkstation Clusters MIMD


52/237


1 .1 Introduction to Parallel Computing1 .2 Comparison of Parallel Computers

1 .2.1 Processors1

.2.2 Memory Organization1 .2.3 Flow of Control1 .2.4 Interconnection Networks

1 .2.4.1 Bus Network1 .2.4.2 Cross-Bar Switch Network1

.2.4.3 Hypercube Network1 .2.4.4 Tree Network1 .2.4.5 Interconnection Networks Self-test

1 .2.5 Summary of Parallel Computer Characteristics1 .3 Summary


53/237

Interconnection Networksy Wh at exactly is t h e interconnection network?

y The interconnection networkis made up of the wires and cables that define how themultiple processors of a parallel computer are connected to each other and to thememory units.

y The time required to transfer data is dependent upon the specific type of theinterconnection network.

y

This transfer time is called the communication time.y Wh at network c h aracteristics are important?

y Diameter: the maximum distance that data must travel for 2 processors tocommunicate.

y Bandwidth: the amount of data that can be sent through a network connection.y Latency: the delay on a network while a data packet is being stored and forwarded.

y Types of Interconnection NetworksThe network topologies (geometric arrangements of the computer network

connections) are:y Busy Cross-bar Switchy Hybercubey

Tree


54/237

Interconnection Networksy The aspects of network issues are:

y Costy Scalabilityy Reliabilityy Suitable Applicationsy

Data Ratey Diametery Degree

y G eneral Network C h aracteristicsy Some networks can be compared in terms of their degree and diameter.y Degree:how many communicating wires are coming out of each processor.

y A large degree is a benefit because it has multiple paths.y Diameter:This is the distance between the two processors that are farthest

apart.y A small diameter corresponds to low latency.


55/237

Bus Networky Bus topology is the original coaxial cable-basedLocal Area N etwork

(L AN ) topology in which the medium forms a single bus to which allstations are attached.

y The positive aspectsy It is also a mature technology that is well known and reliable.y The cost is also very low.y simple to construct.

y The negative aspectsy limited data

transmission rate.y not scalable in termsof performance.

y Example: SGI PowerChallenge.

y Only scaled to1 8processors.


56/237

Cross- Bar Switch Networky A cross-bar switch is a network that works through a switching mechanism to

access shared memory.y it scales better than the bus network but it costs significantly more.

y The telephone system uses this type of network. An example of a computerwith this type of network is the HP V-Class.

y Here is a diagram of across-bar switchnetwork which showsthe processors talkingthrough theswitchboxes to store orretrieve data in

memory.y There are multiplepaths for a processor tocommunicate with acertain memory.

y The switches determinethe optimal route totake.


57/237

Cross- Bar Switch Networky In a hypercube network, the processors are connected as if they

were corners of a multidimensional cube. Each node in an Ndimensional cube is directly connected to N other nodes.

y The fact that the number of directly

connected, "nearest neighbor",nodes increases with the total size of the network is also highly desirablefor a parallel computer.

y The degree of a hypercube network

is log n and the diameter is log n,where n is the number of processors.

y Examples of computers with thistype of network are the CM-2,NCUBE-2, and the Intel iPSC860.


58/237

Tree Networky The processors are the bottom nodes of the tree. For a processor

to retrieve data, it must go up in the network and then go backdown.

y This is useful for decision making applications that can be mappedas trees.

y The degree of a tree network is1 . The diameter of the network is2 log (n+1 )-2 where n is the number of processors.

y The Thinking Machines CM-5 is anexample of a parallel computerwith this type of network.

y Tree networks are very suitable fordatabase applications because itallows multiple searches throughthe database at a time.


59/237

Interconnected Networksy Torus Network: A mesh with wrap-around connections in

both the x and y directions.y Multistage Network: A network with more than one

networking unit.y Fully Connected Network: A network where every processor

is connected to every other processor.y Hypercube Network: Processors are connected as if they

were corners of a multidimensional cube.y

Mesh Network: A network where each interior processor isconnected to its four nearest neighbors.


60/237

Interconnected Networksy Bus Based Network: Coaxial cable based LAN topology in

which the medium forms a single bus to which all stations areattached.

y Cross-bar Switch Network: A network that works through aswitching mechanism to access shared memory.

y Tree Network: The processors are the bottom nodes of thetree.

y Ring Network: Each processor is connected to two othersand the line of connections forms a circle.


61/237

Summary of Parallel ComputerCharacteristicsy How many processors does the computer have?

y 1 0s?y 1 00s?y 1 000s?

y How powerful are the processors?y what's the MHz ratey what's the MIPS rate

y W

hat's the instruction set architecture?y RISCy CISC


62/237

Summary of Parallel ComputerCharacteristicsy How much memory is available?

y total memoryy memory per processor

y W hat kind of memory?y distributed memoryy shared memoryy distributed shared memory

y W hat type of flow of control?y

SIMDy MIMDy SPMD


63/237

Summary of Parallel ComputerCharacteristicsy W hat is the interconnection network?

y Busy Crossbary Hypercubey Treey Torusy Multistagey Fully Connectedy Meshy Ringy Hybrid


64/237

Design decisions made by some of themajor parallel computer vendors

Computer ProgrammingStyle OS Processors MemoryFlow of Control Network

SGIOrigin2000

OpenMPMPI IRIX

MIPS RISCR1 0000 DSM MIMD

CrossbarHypercube

HP V-Class OpenMPMPI HP-UX HP PA 8200 DSM MIMDCrossbarRing

Cray T3E SHMEM Unicos Compaq Alpha Distributed MIMD Torus

IBM SP MPI AIX IBM Power3 Distributed MIMD IBM Switch

W orkstationClusters MPI Linux

Intel PentiumIII Distributed MIMD

MyrinetTree


65/237

Summaryy This completes our introduction to parallel computing.y You have learned about parallelism in computer programs, and

also about parallelism in the hardware components of parallelcomputers.

y In addition, you have learned about the commonly used parallelcomputers, and how these computers compare to each other.

y There are many good texts which provide an introductorytreatment of parallel computing. Here are two useful references:

Highly Parallel C omputing,Second Edition

George S. Almasi and Allan GottliebBenjamin/Cummings Publishers,1 994

Parallel C omputing Theory and PracticeMichael J. QuinnMcGraw-Hill, Inc.,1 994


66/237

Agenda1 Parallel Computing Overview2 How to Parallelize a Code

2.1 Automatic Compiler Parallelism2.2 Data Parallelism by Hand2.3 Mixing Automatic and Hand Parallelism2.4 Task Parallelism2.5 Parallelism Issues

3 Porting Issues4 Scalar Tuning5 Parallel Code Tuning6 Timing and Profiling7 Cache Tuning8 Parallel Performance Analysis9 About the IBM Regatta P690


67/237

How to Parallelize a Codey This chapter describes how to turn a single processor

program into a parallel one, focusing on shared memorymachines.

y Both automatic compiler parallelization and parallelization byhand are covered.

y The details for accomplishing both data parallelism and taskparallelism are presented.


68/237

Automatic Compiler Parallelismy Automatic compiler parallelism enables you to use a

single compiler option and let the compiler do the work.y The advantage of it is that its easy to use.y

The disadvantages are:y The compiler only does loop level parallelism, not taskparallelism.

y The compiler wants to parallelize every do loop in your code.If you have hundreds of do loops this creates way too muchparallel overhead.


69/237

Automatic Compiler Parallelismy To use automatic compiler parallelism on a Linux system

with the Intel compilers, specify the following.

ifort - para ll e l -O2 ... prog.f

y The compiler creates conditional code that will run with anynumber of threads.

y Specify the number of threads and make sure you still get theright answers withsetenv :

s etenv OMP_NUM_THREADS 4 a.out > re s u l t s


70/237

Data Parallelism by Handy First identify the loops that use most of the CPU time (the Profiling

lecture describes how to do this).y By hand, insert into the code OpenMP directive(s) just before the

loop(s) you want to make parallel.y Some code modifications may be needed to remove data dependencies

and other inhibitors of parallelism.y Use your knowledge of the code and data to assist the compiler.y For the SGI Origin2000 computer, insert into the code an OpenMP

directive just before the loop that you want to make parallel.

!$OMP PARALLELDO do i =1, n

l ot s of c omputation ... end do

!$OMP END PARALLEL DO


71/237

Data Parallelism by Handy Compile with the mp compiler option.

f90 - mp ... prog.f

y As before, the compiler generates conditional code that will run with anynumber of threads.

y If you want to rerun your program with a different number of threads, you donot need to recompile, just re-specify thesetenv command.s etenv OMP_NUM_THREADS 8a.out > re s u l t s2

y The setenv command can be placed anywhere before thea.out command.y The setenv command must be typed exactly as indicated. If you have a typo,

you will not receive a warning or error message. To make sure that the setenvcommand is specified correctly, type:s etenv

y It produces a listing of your environment variable settings.


72/237

Mixing Automatic and Hand Parallelismy You can have one source file parallelized automatically by the

compiler, and another source file parallelized by hand.Suppose you split your code into two files named prog1.f and prog2.f .

f90 -c - apo prog 1 .f ( automati c // for prog 1 .f )f90 -c - mp prog 2 .f ( by h and // for prog 2 .f )f90 prog 1 .o prog 2 .o (c reate s one exe c utab l e )a.out > re s u l t s ( run s t h e exe c utab l e )


73/237

Task Parallelismy You can accomplish task parallelism as follows:

!$OMP PARALLEL!$OMP SECTIONS l ot s of c omputation in part A !$OMP SECTION

l ot s of c omputation in part B ... !$OMP SECTION l ot s of c omputation in part C ... !$OMP END SECTIONS!$OMP END PARALLEL

y

Compile with the mp compiler option.f90 - mp prog.fy Use thesetenv command to specify the number of threads.

s etenv OMP_NUM_THREADS 3a.out > re s u l t s


74/237

Parallelism Issuesy There are some issues to consider when parallelizing a

program.y Should data parallelism or task parallelism be used?y

Should automatic compiler parallelism or parallelism by hand be used?y W hich loop in a nested loop situation should be the

one that becomes parallel?y How many threads should be used?


75/237

Agenda1 Parallel Computing Overview2 How to Parallelize a Code3 Porting Issues

3.1 Recompile3.2W ord Length3.3 Compiler Options for Debugging3.4 Standards Violations3.5 IEEE Arithmetic Differences3.6 Math Library Differences3.7 Compute Order Related Differences3.8 Optimization Level Too High3.9 Diagnostic Listings3.1 0 Further Information


76/237


77/237

Recompiley Some codes just need to be recompiled to get accurate results.y The compilers available on the NCSA computer platforms are

shown in the following table:

Language SGI Origin2000 IA-32 Linux IA-64 Linux

MIPSpro PortlandGroup Intel GNUPortlandGroup Intel GNU

Fortran 77 f77 ifort g77 pgf77 ifort g77

Fortran 90 f90 ifort pgf90 ifort

Fortran 90 f95 ifort ifortHighPerformanceFortran

pghpf pghpf

C cc icc gcc pgcc icc gccC++ CC icpc g++ pgCC icpc g++


78/237

Word Lengthy Code flaws can occur when you are porting your code to a

different word length computer.y For C, the size of an integer variable differs depending on the

machine and how the variable is generated. On the IA32 and IA64Linux clusters, the size of an integer variable is 4 and 8 bytes,respectively. On the SGI Origin2000, the corresponding value is 4 bytes if the code is compiled with the n32 flag, and 8 bytes if compiled without any flags or explicitly with the 64 flag.

y For Fortran, the SGI MIPSpro and Intel compilers contain the

following flags to set default variable size.y -in where n is a number: set the default INTEGER to INTEGER*n.The value of n can be 4 or 8 on SGI, and 2, 4, or 8 on the Linuxclusters.

y -rn where n is a number: set the default REAL to REAL*n. The valueof n can be 4 or 8 on SGI, and 4, 8, or1 6 on the Linux clusters.


79/237

Compiler Options for Debugging y On the SGI Origin2000, the MIPSpro compilers include

debugging options via the DEBUG:group. The syntax is asfollows:-DEBUG: option 1[= va l ue 1]: option 2[= va l ue 2] ...

y Two examples are:y Array-bound checking: check for subscripts out of range at

runtime.-DEBUG:s ub sc ript _ch e ck=ON

y

Force all un-initialized stack, automatic and dynamicallyallocated variables to be initialized.-DEBUG: trap _ uninitia l ized =ON


80/237

Compiler Options for Debugging y On the IA32 Linux cluster, the Fortran compiler is

equipped with the following C flags for runtimediagnostics:y -CA: pointers and allocatable referencesy -CB: array and subscript boundsy -CS: consistent shape of intrinsic procedurey -CU: use of uninitialized variablesy -CV: correspondence between dummy and actual

arguments


81/237

Standards Violationsy Code flaws can occur when the program has non-ANSI

standard Fortran coding.y ANSI standard Fortran is a set of rules for compiler writers that

specify, for example, the value of the do loop index upon exitfrom the do loop.

y S tandards Violations Detectiony To detect standards violations on the SGI Origin2000 computer

use the -ansiflag.y This option generates a listing of warning messages for the use

of non-ANSI standard coding.y On the Linux clusters, the -ansi[ -] flag enables/disables

assumption of ANSI conformance.


82/237

IEEE Arithmetic Differencesy Code flaws occur when the baseline computer conforms to the

IEEE arithmetic standard and the new computer does not.y The IEEE Arithmetic Standard is a set of rules governing arithmetic

roundoff and overflow behavior.y For example, it prohibits the compiler writer from replacing x/y

with x *recip (y) since the two results may differ slightly for someoperands. You can make your program strictly conform to the IEEEstandard.

y To make your program conform to the IEEE Arithmetic Standardson the SGI Origin2000 computer use:f90 -OPT:IEEE arithmetic=n ... prog. f where n is 1, 2, or 3.

y This option specifies the level of conformance to the IEEEstandard where1 is the most stringent and 3 is the most liberal.y On the Linux clusters, the Intel compilers can achieve

conformance to IEEE standard at a stringent level with the mpflag, or a slightly relaxed level with the mp1 flag.


83/237

Math Library Differencesy Most high-performance parallel computers are equipped with

vendor-supplied math libraries.y On the SGI Origin2000 platform, there areSGI/ C ray Scientific

Library ( SCS L) andC omplib.sgimath.y SCSL contains Level1 , 2, and 3 Basic Linear Algebra Subprograms

(BLAS), LAPACK and Fast Fourier Transform (FFT) routines.y SCSL can be linked with lscs for the serial version, or mp

lscs_mp for the parallel version.y The complib library can be linked with lcomplib.sgimath for the

serial version, or mp lcomplib.sgimath_mp for the parallelversion.

y The Intel Math Kernel Library (MKL) contains the complete set of functions from BLAS, the extended BLAS (sparse), the completeset of LAPACK routines, and Fast Fourier Transform (FFT)routines.


84/237

Math Library Differencesy On the IA32 Linux cluster, the libraries to link to are:

y For BLAS:-L/usr /loca l/inte l/mk l/lib/32 -l mk l -lguide lp thread y For LAPACK:-L/usr /loca l/inte l/mk l/lib/32 l mk l_la pack -lmk l -lguide

lp thread

y W hen calling MKL routines from C/C++ programs, you alsoneed to link with lF90 .

y On the IA64 Linux cluster, the corresponding libraries are:y For BLAS:-L/usr /loca l/inte l/mk l/lib/64 l mk l_ itp lp thread y

For LAPACK:-L/usr /loca l/inte l/mk l/lib/64 l mk l_la pack lmk l_ itp lpthread y W hen calling MKL routines from C/C++ programs, you also

need to link with-lPEPCF90 lCEPCF90 lF90 -l intrins


85/237

Compute Order Related Differencesy Code flaws can occur because of the non-deterministic computation of

data elements on a parallel computer. The compute order in which thethreads will run cannot be guaranteed.

y For example, in a data parallel program, the 50th index of a do loop may becomputed before the1 0th index of the loop. Furthermore, the threads may

run in one order on the first run, and in another order on the next run of theprogram.y N ote:: If your algorithm depends on data being compared in a specific order,

your code is inappropriate for a parallel computer.y Use the following method to detect compute order related differences:

y If your loop looks likey DO I = 1 , N change it toy DO I = N, 1 , -1 The results should not change if the iterations are

independent


86/237

Optimization Level Too Highy Code flaws can occur when the optimization level has been set too

high thus trading speed for accuracy.y The compiler reorders and optimizes your code based on

assumptions it makes about your program. This can sometimes cause

answers to change at higher optimization level.y S etting t h e Optimization Levely Both SGI Origin2000 computer and IBM Linux clusters provide

Level 0 (no optimization) to Level 3 (most aggressive) optimization,using the O{0,1 ,2, or 3} flag. One should bear in mind that Level 3

optimization may carry out loop transformations that affect thecorrectness of calculations. Checking correctness and precision of calculation is highly recommended when O3 is used.

y For example on the Origin 2000y f90 -O0 prog.f turns off all optimizations.


87/237


88/237

Diagnostic Listingsy The SGI Origin 2000 compiler will generate all

kinds of diagnostic warnings and messages, butnot always by default. Some useful listing options

are:f90 -l i s ting ... f90 - fu llw arn ... f90 -sh o wdefau l t s ... f90 - ver s ion ... f90 -h e l p ...


89/237

Further Informationy SG I

y man f77/f90/ccy man debug_groupy man mathy

man complib.sgimathy MIPSpro 64-Bit Porting and Transition Guidey Online Manuals

y Linux clusters pagesy ifort/icc/icpc help (IA32, IA64, Intel64)y Intel Fortran Compiler for Linuxy Intel C/C++ Compiler for Linux


90/237

Agenday 1 Parallel Computing Overviewy 2 How to Parallelize a Codey 3 Porting Issuesy

4 Scalar Tuningy 4.1 Aggressive Compiler Optionsy 4.2 Compiler Optimizationsy 4.3 Vendor Tuned Codey

4.4 Further Information


91/237

Scalar Tuning y If you are not satisfied with the performance of your

program on the new computer, you can tune the scalar codeto decrease its runtime.

y This chapter describes many of these techniques:y The use of the most aggressive compiler optionsy The improvement of loop unrollingy The use of subroutine inliningy The use of vendor supplied tuned code

y The detection of cache problems, and their solution arepresented in the Cache Tuning chapter.


92/237


93/237

Aggressive Compiler Optionsy It should be noted that O3 might carry out loop

transformations that produce incorrect results in some codes.y It is recommended that one compare the answer obtained from

Level 3 optimization with one obtained from a lower-leveloptimization.

y On the SGI Origin2000 and the Linux clusters, O3 can beused together with OPT:IEEE_arithmetic=n (n=1 ,2, or 3)and mp (or mp1 ), respectively, to enforce operationconformance to IEEE standard at different levels.

y

On the SGI Origin2000, the option-Of ast = ip27is also available. This option specifies the most aggressiveoptimizations that are specifically tuned for the Origin2000computer.


94/237

Agenday 1 Parallel Computing Overviewy 2 How to Parallelize a Codey 3 Porting Issuesy 4 Scalar Tuning

y 4.1 Aggressive Compiler Optionsy 4.2 Compiler Optimizations

y 4.2.1 Statement Levely 4.2.2 Block Levely 4.2.3 Routine Levely 4.2.4 Software Pipeliningy 4.2.5 Loop Unrollingy 4.2.6 Subroutine Inliningy 4.2.7 Optimization Reporty 4.2.8 Profile-guided Optimization (PGO)

y 4.3 Vendor Tuned Codey 4.4 Further Information


95/237

Compiler Optimizationsy The various compiler optimizations can be classified as

follows:y Statement Level Optimizationsy Block Level Optimizationsy Routine Level Optimizationsy Software Pipeliningy Loop Unrollingy Subroutine Inlining

y Each of these are described in the following sections.


96/237

Statement Levely Constant Folding

y Replace simple arithmetic operations on constants with the pre-computed result.

y y = 5+7 becomes y =1 2y S h ort Circuiting

y Avoid executing parts of conditional tests that are not necessary.y if (I.eq.J .or. I.eq.K) expression

when I=J immediately compute the expressiony R egister A ssignment

y Put frequently used variables in registers.


97/237

Block Levely Dead C odeElimination

y Remove unreachable code and code that is never executed orused.

y InstructionSchedulingy Reorder the instructions to improve memory pipelining.


98/237

Routine Levely S trengthR eduction

y Replace expressions in a loop with an expression that takes fewercycles.

y C ommonSubexpressionsEliminationy

Expressions that appear more than once, are computed once, and theresult is substituted for each occurrence of the expression.y C onstant Propagation

y Compile time replacement of variables with constants.y L

oop InvariantE

liminationy Expressions inside a loop that don't change with the do loop index aremoved outside the loop.


99/237

Software Pipelining y Software pipelining allows the mixing of operations from

different loop iterations in each iteration of the hardwareloop. It is used to get the maximum work done per clockcycle.

y N ote:On the R1 0000s there is out-of-order execution of instructions, and software pipelining may actually get in theway of this feature.


100/237

Loop Unrolling y The loops stride (or step) value is increased, and the body of the loop isreplicated. It is used to improve the scheduling of the loop by giving a

longer sequence of straight line code. An example of loop unrollingfollows:

Original Loop U nrolled Loopdo I = 1, 99 do I = 1, 99 , 3c(I) = a (I) + b (I) c(I) = a (I) + b (I)enddo c(I+1) = a (I+1) + b (I+1)

c(I+2) = a (I+2) + b (I+2)enddo

There is a limit to the amount of unrolling that can take place because thereare a limited number of registers.

y On the SGI Origin2000, loops are unrolled to a level of 8 by default.You can unroll to a level of 1 2 by specifying:

f90 -O3 -OPT: unro ll_ time s_ max =12 ... prog.fy On the IA32 Linux cluster, the corresponding flag is unro ll and -unro ll0

for unrolling and no unrolling, respectively.


101/237

Subroutine Inlining y Subroutine inlining replaces a call to a subroutine with

the body of the subroutine itself.y One reason for using subroutine inlining is that when a

subroutine is called inside a do loop that has a hugeiteration count, subroutine inlining may be moreefficient because it cuts down on loop overhead.

y However, the chief reason for using it is that do loops

that contain subroutine calls may not parallelize.


102/237

Subroutine Inlining y On the SGI Origin2000 computer, there are several options to

invoke inlining:y Inline all routines except those specified to -INLINE:never

f90 -O3 -INLINE: a ll prog.f :y Inline no routines except those specified to -INLINE:must

f90 -O3 -INLINE: none prog.f :y Specify a list of routines to inline at every call

f90 -O3 -INLINE: mu s t =s ubrname prog.f :y Specify a list of routines never to inline

f90 -O3 -INLINE: never =s ubrname prog.f :y On the Linux clusters, the following flags can invoke function inlining:

y inline function expansion for calls defined within the current source file- ip :

y inline function expansion for calls defined in separate files- ipo :


103/237

Optimization Reporty Intel 9.x and later compilers can generate reports that provide

useful information on optimization done on different parts of yourcode.y To generate such optimization reports in a file filename, add the flag -

opt-report-file filename.y If you have a lot of source files to process simultaneously, and you use

a makefile to compile, you can also use make's "suffix" rules to haveoptimization reports produced automatically, each with a uniquename. For example,.f.o :

ifort -c - o $@ $(FFLAGS) - opt - report - fi l e $* .opt $* .f y creates optimization reports that are named identically to the original

Fortran source but with the suffix ".f" replaced by ".opt".


104/237

Optimization Reporty To help developers and performance analysts navigate through the

usually lengthy optimization reports, the NCSA program OptView isdesigned to provide an easy-to-use and intuitive interface that allows theuser to browse through their own source code, cross-referenced withthe optimization reports.

y

OptView is installed on NCSA's IA64 Linux cluster under the directory/usr /a pp s /too ls /bin . You can either add that directory to your UNIXPATH or you can invoke optview using an absolute path name. You'llneed to be using the X-W indow system and to have set your DISPLAYenvironment variable correctly for OptView to work.

y Optview can provide a quick overview of which loops in a source codeor source codes among multiple files are highly optimized and whichmight need further work. For a detailed description of use of OptView,readers see:http://perfsuite.ncsa.uiuc.edu/OptView/


105/237

Profile-guided Optimization (PGO)y Profile-guided optimization allows Intel compilers to use

valuable runtime information to make better decisions aboutfunction inlining and interprocedural optimizations togenerate faster codes. Its methodology is illustrated as

follows:


106/237


107/237

Vendor Tuned Codey Vendor math libraries have codes that are optimized for their

specific machine.y On the SGI Origin2000 platform, Complib.sgimath and SCSL

are available.y On the Linux clusters, Intel MKL is available.W ays to link to

these libraries are described inSection3 - Porting Issues.


108/237

Further Informationy SG I I R IX man and www pages

y man opty man lnoy man inliney man ipay man perfexy Performance Tuning for the Origin2000 at

http://www.ncsa.uiuc.edu/UserInfo/Resources/Hardware/Origin2000OLD/Doc/

y Linux clusters h elp and www pagesy ifort/icc/icpc help (Intel)y http://www.ncsa.uiuc.edu/UserInfo/Resources/Hardware/Intel64Cluster/

(Intel64)y http://www.ncsa.uiuc.edu/UserInfo/Resources/Hardware/Intel64Cluster/

(Intel64)y http://perfsuite.ncsa.uiuc.edu/OptView/


109/237

Agenda1 Parallel Computing Overview2 How to Parallelize a Code3 Porting Issues4 Scalar Tuning5 Parallel Code Tuning

5.1 Sequential Code Limitation5.2 Parallel Overhead

5.3 Load Balance5.3.1 Loop Schedule Types5.3.2 Chunk Size


110/237

Parallel Code Tuning y This chapter describes several of the most common

techniques for parallel tuning, the type of programs that benefit, and the details for implementing them.

y The majority of this chapter deals with improving load balancing.


111/237

Sequential Code Limitationy Sequential code is a part of the program that cannot be run withmultiple processors. Some reasons why it cannot be made data

parallel are:y The code is not in a do loop.y The do loop contains a read or write.y

The do loop contains a dependency.y The do loop has an ambiguous subscript.y The do loop has a call to a subroutine or a reference to a function

subprogram.y S equential Code Fraction

y As shown by Amdahls Law, if the sequential fraction is too large,there is a limitation on speedup. If you think too much sequentialcode is a problem, you can calculate the sequential fraction of codeusing the Amdahls Law formula.


112/237

Sequential Code Limitationy M easuring t h e S equential Code Fraction

y Decide how many processors to use, this is p.y Run and time the program with1 processor to give T(1 ).y Run and time the program with p processors to give T(2).y Form a ratio of the 2 timings T(1 )/T(p), this is SP.y Substitute SP and p into the Amdahls Law formula:

y f=(1/ S P-1/p)/(1-1/p) , where f is the fraction of sequential code.y Solve for f, this is the fraction of sequential code.

y Decreasing t h e S equential Code Fractiony The compilation optimization reports list which loops could not be

parallelized and why. You can use this report as a guide to improveperformance on do loops by:

y Removing dependenciesy Removing I/Oy Removing calls to subroutines and function subprograms


113/237

Parallel Overheady Parallel overhead is the processing time spent

y creating threadsy spin/blocking threadsy starting and ending parallel regionsy synchronizing at the end of parallel regionsy W hen the computational work done by the parallel processes is too

small, the overhead time needed to create and control the parallelprocesses can be disproportionately large limiting the savings due toparallelism.

y M easuring Parallel Over h eady To get a rough under-estimate of parallel overhead:

y Run and time the code using1 processor.y Parallelize the code.y Run and time the parallel code using only1 processor.y Subtract the 2 timings.


114/237

Parallel Overheady R educing Parallel Over h ead

y To reduce parallel overhead:y Don't parallelize all the loops.y Don't parallelize small loops.

y To benefit from parallelization, a loop needs about1 000 floating

point operations or 500 statements in the loop. You can use the IFmodifier in the OpenMP directive to control when loops areparallelized.

!$OMP PARALLEL DO IF( n > 5 00 )do i =1, n ... body of l oop ... end do !$OMP END PARALLEL DO

y Use task parallelism instead of data parallelism. It doesn't generate asmuch parallel overhead and often more code runs in parallel.

y Don't use more threads than you need.y

Parallelize at the highest level possible.


115/237

Load Balancey Load balance

y is the even assignment of subtasks to processors so as to keep eachprocessor busy doing useful work for as long as possible.

y Load balance is important for speedup because the end of a do loop isa synchronization point where threads need to catch up with eachother.

y If processors have different work loads, some of the processors willidle while others are still working.

y M easuring Load Balancey On the SGI Origin, to measure load balance, use the perfex tool

which is a command line interface to the R1 0000 hardware counters.

The commandper f ex -e 16 - mp a.out > resu lts y reports per thread cycle counts. Compare the cycle counts to

determine load balance problems. The master thread (thread 0)always uses more cycles than the slave threads. If the counts are vastlydifferent, it indicates load imbalance.


116/237

Load Balancey For linux systems, the thread cpu times can be compared

with ps. A thread with unusually high or low time comparedto the others may not be working efficiently [high cputimecould be the result of a thread spinning while waiting for

other threads to catch up].ps uH

y Improving Load Balancey To improve load balance, try changing the way that loop

iterations are allocated to threads byy changing the loop schedule typey changing the chunk sizey These methods are discussed in the following sections.


117/237

Loop Schedule Typesy On the SGI Origin2000 computer, 4 different loop schedule

types can be specified by an OpenMP directive. They are:y Staticy Dynamicy Guidedy Runtime

y If you don't specify a schedule type, the default will be used.y Default S ch edule Type

y The default schedule type allocates 20 iterations on 4 threads as:


118/237

Loop Schedule Typesy S tatic S ch edule Type

y The static schedule type is used when some of the iterations do morework than others.W ith the static schedule type, iterations areallocated in a round-robin fashion to the threads.

y A n Exampley Suppose you are computing on theupper triangle of a1 00 x 1 00

matrix, and you use 2 threads,named t0 and t1 . W ith defaultscheduling, workloads are uneven.


119/237

Loop Schedule Typesy W hereas with static scheduling, the columns of the matrix

are given to the threads in a round robin fashion, resulting in better load balance.


120/237

Loop Schedule Typesy Dynamic S ch edule Type

y The iterations are dynamically allocated to threads at runtime. Eachthread is given a chunk of iterations.W hen a thread finishes its work,it goes into a critical section where its given another chunk of iterations to work on.

y

This type is useful when you dont know the iteration count or workpattern ahead of time. Dynamic gives good load balance, but at a highoverhead cost.

y G uided S ch edule Typey The guided schedule type is dynamic scheduling that starts with large

chunks of iterations and ends with small chunks of iterations. That is,the number of iterations given to each thread depends on the numberof iterations remaining. The guided schedule type reduces the numberof entries into the critical section, compared to the dynamic scheduletype. Guided gives good load balancing at a low overhead cost.


121/237

Chunk Sizey

The word chunk refers to a grouping of iterations.C hunk sizemeanshow many iterations are in the grouping. The static and dynamicschedule types can be used with a chunk size. If a chunk size is notspecified, then the chunk size is1 .

y Suppose you specify a chunk size of 2 with the static schedule type.Then 20 iterations are allocated on 4 threads:

y The schedule type and chunk size are specified as follows:!$OMP PARALLEL DO SCHEDULE( type , ch un k)!$OMP END PARALLEL DO

y W here typeis STATIC, or DYNAMIC, or GUIDED andchunkis anypositive integer.


122/237

Agenda1 Parallel Computing Overview2 How to Parallelize a Code3 Porting Issues4 Scalar Tuning5 Parallel Code Tuning

6 Timing and Profiling6.1 Timing

6.1 .1 Timing a Section of Code6.1 .1 .1 CPU Time6.1 .1 .2 W all clock Time

6.1 .2 Timing an Executable

6.1

.3 Timing a Batch Job6.2 Profiling6.2.1 Profiling Tools6.2.2 Profile Listings6.2.3 Profiling Analysis



123/237

Timing and Profiling y Now that your program has been ported to the new

computer, you will want to know how fast it runs.y This chapter describes how to measure the speed of a

program using various timing routines.y The chapter also covers how to determine which parts of the

program account for the bulk of the computational load sothat you can concentrate your tuning efforts on thosecomputationally intensive parts of the program.


124/237

Timing y In the following sections, well discuss timers and review theprofiling tools ssrun and prof on the Origin and vprof and gprof

on the Linux Clusters. The specific timing functions described are:y Timing a section of code

F ORT RA N y

etime, dtime, cpu_time for CPU timey time and f_time for wallclock timeC

y clock for CPU timey gettimeofday for wallclock time

y

Timing an executabley time a.outy Timing a batch run

y busagey qstaty qhist


125/237

CPU Timey etime

y A section of code can be timed using etime.y It returns the elapsed CPU time in seconds since the program

started.

rea l*4 tarray (2), time 1, time 2, timere s beginning of program time 1= etime ( tarray ) s tart of s e c tion of c ode to be timed l ot s of c omputation

end of s e c tion of c ode to be timed time 2= etime ( tarray )timere s= time 2- time 1


126/237

CPU Timey dtime

y A section of code can also be timed using dtime.y It returns the elapsed CPU time in seconds since the last call to

dtime.

rea l*4 tarray (2), timere s beginning of program timere s= dtime ( tarray ) s tart of s e c tion of c ode to be timed l ot s of c omputation

end of s e c tion of c ode to be timed timere s= dtime ( tarray ) re s t of program


127/237

CPU TimeT h e etime and dtime Functionsy U ser time.

y This is returned as the first element of tarray.y Its the CPU time spent executing user code.

y S ystem time.y This is returned as the second element of tarray.y Its the time spent executing system calls on behalf of your program.

y Sum of user and system time.y This is the function value that is returned.y

Its the time that is usually reported.y Metric.

y Timings are reported in seconds.y Timings are accurate to1 / 1 00th of a second.


128/237


129/237

CPU Timecpu_timey The cpu_time routine is available only on the Linux clusters as it is

a component of the Intel FORTRAN compiler library.y It provides substantially higher resolution and has substantially

lower overhead than the older etime and dtime routines.y

It can be used as an elapsed timer.rea l*8 time 1, time 2, timere s beginning of program c a ll c pu _ time ( time 1) s tart of s e c tion of c ode to be timed l ot s of c omputation end of s e c tion of c ode to be timed c a ll c pu _ time ( time 2)timere s= time 2- time 1 re s t of program


130/237

CPU Timeclocky For C programmers, one can call the cpu_time routine using a

FORTRAN wrapper or call the intrinsic function clock that can beused to determine elapsed CPU time.

in cl ude s tati c c on s t doub l e i CPS =

1 .0 /( doub l e )CLOCKS_PER_SEC;doub l e time 1, time 2, timre s;time 1=(cl o ck()* i CPS);/* do s ome wor k */time 2=(cl o ck()* i CPS);timer s= time 2- time 1;


131/237

Wall clock Timetimey For the Origin, the functiontimereturns the time since

00:00:00 GMT, Jan.1 , 1 970.y It is a means of getting the elapsed wall clock time.y

The wall clock time is reported in integer seconds.external time integer*4 time1 ,time2,timeres

beginning of program time 1= time ( )

s tart of s e c tion of c ode to be timed l ot s of c omputation end of s e c tion of c ode to be timed time 2= time ( )timere s= time 2 - time 1

ll l k


132/237

Wall clock Timef_timey For the Linux clusters, the appropriate FORTRAN function for elapsed

time is f_time.

integer *8 f _ timeexterna l f _ time

integer *8 time 1, time 2, timere s beginning of program time 1= f _ time () s tart of s e c tion of c ode to be timed l ot s of c omputation end of s e c tion of c ode to be timed time 2= f _ time ()timere s= time 2 - time 1

y As above for etime and dtime, the f_time function is in the VAXcompatibility library of the Intel FORTRAN Compiler. To use thislibrary include the compiler flag -Vaxlib.

W ll l k Ti


133/237

Wall clock Timegettimeofdayy For C programmers, wallclock time can be obtained by using the very

portable routine gettimeofday.

# in cl ude /* definition of NULL */# in cl ude /* definition of timeva l s tru c t and

protyping of gettimeofday */doub l e t 1, t 2, e l ap s ed ;s tru c t timeva l tp ;int rtn ;.... .... rtn =gettimeofday (& tp , NULL);

t 1=( doub l e ) tp.tv _s e c+(1 .e -6)* tp.tv _ u s e c;.... /* do s ome wor k */.... rtn =gettimeofday (& tp , NULL);t 2=( doub l e ) tp.tv _s e c+(1 .e -6)* tp.tv _ u s e c;e l ap s ed =t 2- t 1;

Ti i E bl


134/237

Timing an Executabley To time an executable (if using a csh or tcsh shell, explicitly

call /usr/bin/time)

time option s a.out

y whereoptionscan be -p for a simple output or -f f ormat which allows the user to display more than just time relatedinformation.

y

Consult the man pages on the time command for formatoptions.

Ti i B h J b


135/237

Timing a Batch Joby Time of a batch job running or completed.

y Originbu s age j obid

y Linux clustersq s tat j obid # for a running j ob q h i s t j obid # for a c omp l eted j ob

A d


136/237

Agenda1

Parallel Computing Overview2 How to Parallelize a Code3 Porting Issues4 Scalar Tuning5 Parallel Code Tuning

6 Timing and Profiling6.1 Timing6.1 .1 Timing a Section of Code

6.1 .1 .1 CPU Time6.1 .1 .2 W all clock Time

6.1 .2 Timing an Executable

6.1

.3 Timing a Batch Job6.2 Profiling6.2.1 Profiling Tools6.2.2 Profile Listings6.2.3 Profiling Analysis


P fili


137/237

Profiling y Profiling determines where a program spends its time.

y It detects the computationally intensive parts of the code.

y Use profiling when you want to focus attention andoptimization efforts on those loops that are responsible forthe bulk of the computational load.

y Most codes follow the9 0 -10 R ule.y That is, 90% of the computation is done in1 0% of the code.

P fili T l


138/237

Profiling ToolsProfiling T ools on t h e Originy On the SGI Origin2000 computer there are profiling tools named

ssrun and prof.y Used together they do profiling, or what is called hot spot analysis.y They are useful for generating timing profiles.

y

ssruny The ssrun utility collects performance data for an executable that youspecify.

y The performance data is written to a file named"executablename.exptype.id".

y prof y The prof utility analyzes the data file created by ssrun and produces a

report.y Example

ss run - fp cs amp a.outprof -h a.out.fp cs amp.m 12345 > prof. l i s t

P fili T l


139/237

Profiling ToolsProfiling T ools on t h e Linux Clustersy On the Linux clusters the profiling tools are still maturing. There are

currently several efforts to produce tools comparable to the ssrun,prof and perfex tools. .

y gprof y Basic profiling information can be generated using the OS utility gprof.y First, compile the code with the compiler flags -qp -g for the Intel

compiler (-g on the Intel compiler does not change the optimizationlevel) or -pg for the GNU compiler.

y Second, run the program.y Finally analyze the resulting gmon.out file using the gprof utility: gprof

exe c utab l e gmon.out.

ef c -O - q p - g - o foo foo.f. / foogprof foo gmon.out

P fili g T l


140/237

Profiling Tools

Profiling T ools on t h e Linux Clustersy vprof

y On the IA32 platform there is a utility called vprof that providesperformance information using the PAPI instrumentation

library.y To instrument the whole application requires recompiling and

linking to vprof and PAPI libraries.

s etenv VMON PAPI_TOT_CYCif c - g -O - o md md.f

/ u s r / app s/ too ls/ vprof /l ib / vmonauto _ g cc .o -L/ u s r / app s/ too ls/l ib -l vmon -l papi

. / md / u s r / app s/ too ls/ vprof / bin /c prof - e md vmon.out

P fil Li ti g


141/237

Cy cl e s % C um % S e cs P ro c-------- ----- ----- ---- ----4263 09 84 58 . 47 58 . 47 0. 57 VSUB

64 9 82 9 4 8 .9 1 67 . 38 0.09 PFSOR6141611 8 . 42 75 . 81 0.0 8 PBSOR365412 0 5 .0 1 8 0. 82 0.0 5 PFSOR1261586 0 3 . 5 9 84 . 41 0.0 3 VADD158 0 424 2 . 17 86 . 57 0.0 2 ITSRCG1144 0 36 1 . 57 88 . 14 0.0 2 ITSRSI

886 0 44 1 . 22 8 9. 36 0.0 1 ITJSI

861136 1 . 18 90. 54 0.0 1 ITJCG

Profile Listings

Profile Listings on t h e Originy Prof Output First Listing

y The first listing gives the number of cycles executed in eachprocedure (or subroutine). The procedures are listed indescending order of cycle count.

Profile Listings


142/237

Cy cl e s % C um % L ine Pro c-------- ----- ----- ---- ----36556 9 44 5 0. 14 5 0. 14 81 0 6 VSUB

53131 9 8 7 . 2 9 57 . 43 6 9 74 PFSOR4 9 688 0 4 6 . 81 64 . 24 6671 PBSOR2 9 8 9 882 4 . 1 0 68 . 34 81 0 7 VSUB2564544 3 . 52 71 . 86 7 09 7 PFSOR11 9 8842 0 2 . 73 74 . 5 9 81 0 3 VSUB162 9 776 2 . 24 76 . 82 8 0 45 VADD

99 421 0 1 . 36 78 . 1 9 81 0 8 VSUB9 6 90 56 1 . 33 7 9. 52 8 0 4 9 VADD483 0 18 0. 66 8 0. 18 6 9 72 PFSOR

Profile Listings

Profile Listings on t h e Originy Prof Output S econd Listing

y The second listing gives the number of cycles per sourcecode line.

y The lines are listed in descending order of cycle count.

Profile Listings


143/237

Fl at profi l e :

Ea ch s amp l e c ount s a s 0.0009 76562 s e c ond s .% c umu l ative s e l f s e l f tota l

time s e c ond s s e c ond s c a lls u s/c a ll u s/c a ll name

----- ---------- ------- ----- ------- ------- -----------38 .0 7 5 . 67 5 . 67 1 0 1 56157 . 18 1 0 745 0. 88 c ompute _34 . 72 1 0. 84 5 . 17 251 99 5 00 0. 21 0. 21 di s t _25 . 48 14 . 64 3 . 8 0 SIND_SINCOS

1 . 25 14 . 83 0. 1 9 s in0. 37 14 . 88 0.0 6 c o s0.0 5 14 . 8 9 0.0 1 5 0 5 00 0. 15 0. 15 dotr 8_0.0 5 14 .90 0.0 1 1 00 68 . 36 68 . 36 update _0.0 1 14 .90 0.00 f _ fioinit0.0 1 14 .90 0.00 f _ intorange0.0 1 14 .90 0.00 mov0.00 14 .90 0.00 1 0.00 0.00 initia l ize _

Profile Listings

Profile Listings on t h e Linux Clustersy gprof Output First Listing

y The listing gives a 'flat' profile of functions and routinesencountered, sorted by 'self seconds' which is the number of seconds accounted for by this function alone.

Profile Listings


144/237

Ca ll grap h:

index % time s e l f ch i l dren c a ll ed name----- ------ ---- -------- ---------------- ----------------[1] 72 .9 0.00 1 0. 86 main [1]

5 . 67 5 . 18 1 0 1/1 0 1 c ompute _ [2]0.0 1 0.00 1 00 /1 00 update _ [8]0.00 0.00 1/1 initia l ize _ [12]

---------------------------------------------------------------------5 . 67 5 . 18 1 0 1/1 0 1 main [1]

[2] 72 . 8 5 . 67 5 . 18 1 0 1 c ompute _ [2]5 . 17 0.00 251 99 5 00 /251 99 5 00 di s t _ [3]0.0 1 0.00 5 0 5 00 /5 0 5 00 dotr 8_ [7]

---------------------------------------------------------------------5 . 17 0.00 251 99 5 00 /251 99 5 00 c ompute _ [2]

[3] 34 . 7 5 . 17 0.00 251 99 5 00 di s t _ [3]---------------------------------------------------------------------

[4] 25 . 5 3 . 8 0 0.00 SIND_SINCOS [4]

Profile ListingsProfile Listings on t

h

e Linux Clustersy gprof Output S econd Listing

y The second listing gives a 'call-graph' profile of functions and routines encountered. Thdefinitions of the columns are specific to the line in question. Detailed information is

contained in the full output from gprof.

Profile Listings


145/237

Co l umn s c orre s pond to t h e fo ll o wing event s:PAPI_TOT_CYC - T ota l c y cl e s (1 9 56 event s)

F i l e Summary :1 00.0 % / u / n cs a / gbauer / temp / md.f

Fun c tion Summary :84 . 4% c ompute15 . 6% di s t

Line Summary :67 . 3% / u / n cs a / gbauer / temp / md.f :1 0 613 . 6% / u / n cs a / gbauer / temp / md.f :1 0 4

9. 3% / u / n cs a / gbauer / temp / md.f :1662 . 5% / u / n cs a / gbauer / temp / md.f :1651 . 5% / u / n cs a / gbauer / temp / md.f :1 0 21 . 2% / u / n cs a / gbauer / temp / md.f :1640.9 % / u / n cs a / gbauer / temp / md.f :1 0 70. 8% / u / n cs a / gbauer / temp / md.f :16 90. 8% / u / n cs a / gbauer / temp / md.f :1620. 8% / u / n cs a / gbauer / temp / md.f :1 0 5

Profile ListingsProfile Listings on t

he Linux Clusters

y vprof Listing

y The above listing from (using the -e option to cprof), displays not only cycles consumed byfunctions (a flat profile) but also the lines in the code that contribute to those functions.


146/237

Profiling Analysis


147/237

Profiling Analysisy

The program being analyzed in the previous Origin example hasapproximately1 0000 source code lines, and consists of manysubroutines.

y The first profile listing shows that over 50% of the computation is doneinside the VSUB subroutine.

y The second profile listing shows that line 81 06 in subroutine VSUBaccounted for 50% of the total computation.

y Going back to the source code, line 81 06 is a line inside a do loop.y Putting an OpenMP compiler directive in front of that do loop you can get

50% of the program to run in parallel with almost no work on your part.y Since the compiler has rearranged the source lines the line numbers

given by ssrun/prof give you an area of the code to inspect.y

To view the rearranged source use the optionf90 -FLIST:=ONcc -CLIST:=ON

y For the Intel compilers, the appropriate options areifort E i cc -E

Further Information


148/237

Further Informationy SG

I Irixy man etimey man 3 timey man1 timey man busagey man timersy man ssruny man prof y Origin2000 Performance Tuning and Optimization Guide

y Linux Clustersy man 3 clocky man 2 gettimeofdayy man1 timey man1 gprof y man1 B qstaty Intel Compilers Vprof on NCSA Linux Cluster

Agenda


149/237

Agenda1 Parallel Computing Overview2 How to Parallelize a Code3 Porting Issues4 Scaler Tuning5 Parallel Code Tuning6 Timing and Profiling7 Cache Tuning

8 Parallel Performance Analysis9 About the IBM Regatta P690

Agenda


150/237

Agenda

7 Cache Tuning7.1 Cache Concepts7.1 .1 Memory Hierarchy7.1 .2 Cache Mapping7.1 .3 Cache Thrashing7.1 .4 Cache Coherence

7.2 Cache Specifics7.3 Code 0ptimization7.4 Measuring Cache Performance7.5 Locating the Cache Problem7.6 Cache Tuning Strategy7.7 Preserve Spatial Locality

7.8 Locality Problem7.9 Grouping Data Together7.1 0 Cache Thrashing Example7.11 Not Enough Cache7.1 2 Loop Blocking7.1 3 Further Information

Cache Concepts


151/237

Cache Conceptsy

The CPU time required to perform an operation is the sum of theclock cycles executing instructions and the clock cycles waitingfor memory.

y The CPU cannot be performing useful work if it is waiting fordata to arrive from memory.

y

Clearly then, the memory system is a major factor in determiningthe performance of your program and a large part is your use of the cache.

y The following sections will discuss the key concepts of cacheincluding:y Memory subsystem hierarchyy Cache mappingy Cache thrashingy Cache coherence

Memory Hierarchy


152/237

Memory Hierarchyy

The different subsystems in the memory hierarchy have differentspeeds, sizes, and costs.

y Smaller memory is fastery Slower memory is cheaper

y The

Documents

01-Parallel Computing Explained