01-Parallel Computing Explained

Embed Size (px)

Citation preview

  • 8/8/2019 01-Parallel Computing Explained

    1/237

  • 8/8/2019 01-Parallel Computing Explained

    2/237

    Agenda1 Parallel Computing Overview2 How to Parallelize a Code3 Porting Issues

    4 Scalar Tuning5 Parallel Code Tuning6 Timing and Profiling7 Cache Tuning8 Parallel Performance Analysis9 About the IBM Regatta P690

  • 8/8/2019 01-Parallel Computing Explained

    3/237

    Agenda1 Parallel Computing Overview

    1 .1 Introduction to Parallel Computing1 .1 .1 Parallelism in our Daily Lives1 .1 .2 Parallelism in Computer Programs1 .1 .3 Parallelism in Computers1 .1 .3.4 Disk Parallelism1 .1 .4 Performance Measures1 .1 .5 More Parallelism Issues

    1.2 Comparison of Parallel Computers

    1 .3 Summary

  • 8/8/2019 01-Parallel Computing Explained

    4/237

    Parallel Computing Overviewy W ho should read this chapter?

    y New Users to learn concepts and terminology.y Intermediate Users for review or reference.y

    Management Staff to understand the basic concepts even if you dont plan to do any programming.y Note: Advanced users may opt to skip this chapter.

  • 8/8/2019 01-Parallel Computing Explained

    5/237

    Introduction to Parallel Computing y High performance parallel computers

    y can solve large problems much faster than a desktop computery fast CPUs, large memory, high speed interconnects, and high speed

    input/outputy

    able to speed up computationsy by making the sequential components run fastery by doing more operations in parallel

    y High performance parallel computers are in demandy need for tremendous computational capabilities in science,

    engineering, and business.y require gigabytes/terabytes f memory and gigaflops/teraflops of

    performancey scientists are striving for petascale performance

  • 8/8/2019 01-Parallel Computing Explained

    6/237

    Introduction to Parallel Computing y HPPC are used in a wide variety of disciplines.

    y Meteorologists: prediction of tornadoes and thunderstormsy Computational biologists: analyze DNA sequencesy Pharmaceutical companies: design of new drugsy Oil companies: seismic explorationy W all Street: analysis of financial marketsy NASA: aerospace vehicle designy Entertainment industry: special effects in movies and

    commercialsy These complex scientific and business applications all need to

    perform computations on large datasets or large equations.

  • 8/8/2019 01-Parallel Computing Explained

    7/237

    Parallelism in our Daily Livesy There are two types of processes that occur in computers and

    in our daily lives:y Sequential processes

    y occur in a strict ordery it is not possible to do the next step until the current one is completed.y Examples

    y The passage of time: the sun rises and the sun sets.y W riting a term paper: pick the topic, research, and write the paper.

    y Parallel processesy many events happen simultaneouslyy Examples

    y Plant growth in the springtimey An orchestra

  • 8/8/2019 01-Parallel Computing Explained

    8/237

    Agenda1 Parallel Computing Overview

    1 .1 Introduction to Parallel Computing1 .1 .1 Parallelism in our Daily Lives1 .1 .2 Parallelism in Computer Programs

    1 .1 .2.1 Data Parallelism1 .1 .2.2 Task Parallelism

    1 .1 .3 Parallelism in Computers1 .1 .3.4 Disk Parallelism1

    .1

    .4 Performance Measures1 .1 .5 More Parallelism Issues1 .2 Comparison of Parallel Computers1 .3 Summary

  • 8/8/2019 01-Parallel Computing Explained

    9/237

    Parallelism in Computer Programsy Conventional wisdom:

    y Computer programs are sequential in naturey Only a small subset of them lend themselves to parallelism.y Algorithm: the "sequence of steps" necessary to do a computation.y The first 30 years of computer use, programs were run sequentially.

    y The 1 980's saw great successes with parallel computers.y Dr. Geoffrey Fox published a book entitled Parallel Computing

    W orks!y many scientific accomplishments resulting from parallel computingy Computer programs are parallel in naturey Only a small subset of them need to be run sequentially

  • 8/8/2019 01-Parallel Computing Explained

    10/237

    Parallel Computing y W hat a computer does when it carries out more than one

    computation at a time using more than one processor.y By using many processors at once, we can speedup the execution

    y If one processor can perform the arithmetic in time t.y Then ideally p processors can perform the arithmetic in time t/p.y W hat if I use1 00 processors?W hat if I use1 000 processors?

    y Almost every program has some form of parallelism.y You need to determine whether your data or your program can be

    partitioned into independent pieces that can be run simultaneously.y Decomposition is the name given to this partitioning process.

    y Types of parallelism:y data parallelismy task parallelism.

  • 8/8/2019 01-Parallel Computing Explained

    11/237

    Data Parallelismy The same code segment runs concurrently on each processor,

    but each processor is assigned its own part of the data towork on.y Do loops (in Fortran) define the parallelism.y The iterations must be independent of each other.y Data parallelism is called "fine grain parallelism" because the

    computational work is spread into many small subtasks.y Example

    y Dense linear algebra, such as matrix multiplication, is a perfectcandidate for data parallelism.

  • 8/8/2019 01-Parallel Computing Explained

    12/237

    An example of data parallelism

    Original Sequential Code Parallel Code

    DO K=1,NDO J=1,NDO I=1,NC(I,J) = C(I,J) +A(I,K)*B(K,J)END DOEND DOEND DO

    !$OMP PARALLEL DO

    DO K=1,NDO J=1,NDO I=1,NC(I,J) = C(I,J) +A(I,K)*B(K,J)END DOEND DOEND DO!$END PARALLEL DO

  • 8/8/2019 01-Parallel Computing Explained

    13/237

    Quick Intro to OpenMPy OpenMP is a portable standard for parallel directives

    covering both data and task parallelism.y More information about OpenMP is available on theOpenMP

    website.y W e will have a lecture onIntroduction to OpenMP later.

    y W ith OpenMP, the loop that is performed in parallel is theloop that immediately follows the Parallel Do directive.y In our sample code, it's the K loop:

    y DO K=1,N

  • 8/8/2019 01-Parallel Computing Explained

    14/237

    OpenMP Loop ParallelismIteration-ProcessorAssignments

    The code segment running on each processor

    DO J=1,NDO I=1,NC(I,J) = C(I,J) +

    A(I,K)*B(K,J)END DOEND DO

    Processor Iterations

    of K

    Data

    Elementsproc0 K=1 :5 A(I,

    1 :5)B(1 :5 ,J)

    proc1 K=6:1 0 A(I, 6:1 0)

    B(6:1 0 ,J)

    proc2 K=11 :1 5 A(I,11 :1 5)B(11 :1 5 ,J)

    proc3 K=1 6:20 A(I,1 6:20)

    B(1 6:20 ,J)

  • 8/8/2019 01-Parallel Computing Explained

    15/237

    OpenMP Style of Parallelismy can be done incrementally as follows:

    1 . Parallelize the most computationally intensive loop.2. Compute performance of the code.3.

    If performance is not satisfactory, parallelize another loop.4. Repeat steps 2 and 3 as many times as needed.y The ability to performincremental parallelismis considered a

    positive feature of data parallelism.y

    It is contrasted with the MPI (Message Passing Interface)style of parallelism, which is an "all or nothing" approach.

  • 8/8/2019 01-Parallel Computing Explained

    16/237

    Task Parallelismy Task parallelism may be thought of as the opposite of data

    parallelism.y Instead of the same operations being performed on different parts

    of the data, each process performs different operations.y You can use task parallelism when your program can be split into

    independent pieces, often subroutines, that can be assigned todifferent processors and run concurrently.

    y Task parallelism is called "coarse grain" parallelism because thecomputational work is spread into just a few subtasks.

    y More code is run in parallel because the parallelism isimplemented at a higher level than in data parallelism.

    y Task parallelism is often easier to implement and has less overheadthan data parallelism.

  • 8/8/2019 01-Parallel Computing Explained

    17/237

    Task Parallelismy The abstract code shown in the diagram is decomposed into

    4 independent code segments that are labeled A, B, C, and D.The right hand side of the diagram illustrates the 4 codesegments running concurrently.

  • 8/8/2019 01-Parallel Computing Explained

    18/237

    Task Parallelism

    Original Code Parallel Codeprogram main

    c ode s egment l abe l ed A

    c ode s egment l abe l ed B

    c ode s egment l abe l ed C

    c ode s egment l abe l ed D

    end

    program main

    c ode s egment l abe l ed A

    c ode s egment l abe l ed B

    c ode s egment l abe l ed C

    c ode s egment l abe l ed D

    end

    program main !$OMP PARALLEL!$OMP SECTIONSc ode s egment l abe l ed A!$OMP SECTIONc ode s egment l abe l ed B!$OMP SECTIONc ode s egment l abe l ed C

    !$OMP SECTIONc ode s egment l abe l ed D!$OMP END SECTIONS!$OMP END PARALLELend

  • 8/8/2019 01-Parallel Computing Explained

    19/237

    OpenMP Task Parallelismy W ith OpenMP, the code that follows each SECTION(S)

    directive is allocated to a different processor. In our sampleparallel code, the allocation of code segments to processors isas follows.

    Processor Code

    proc0 code segmentlabeled A

    proc1 code segment

    labeled B

    proc2 code segmentlabeled C

    proc3 code segmentlabeled D

  • 8/8/2019 01-Parallel Computing Explained

    20/237

    Parallelism in Computersy How parallelism is exploited and enhanced within the

    operating system and hardware components of a parallelcomputer:y operating systemy arithmeticy memoryy disk

  • 8/8/2019 01-Parallel Computing Explained

    21/237

    Operating System Parallelismy All of the commonly used parallel computers run a version of the

    Unix operating system. In the table below each OS listed is in factUnix, but the name of the Unix OS varies with each vendor.

    y For more information about Unix, a collection of Unix documentsis available.

    Parallel Computer OS

    SGI Origin2000 IRIX

    HP V-Class HP-UX

    Cray T3E Unicos

    IBM SP AIX

    W orkstationClusters Linux

  • 8/8/2019 01-Parallel Computing Explained

    22/237

    Two Unix Parallelism Featuresy background processing facility

    y W ith the Unix background processing facility you can run theexecutablea.out in the background and simultaneously view theman page for theetimefunction in the foreground. There are

    two Unix commands that accomplish this:

    a.out > re s u l t s &man etime

    y cron featurey W ith the Unix cron feature you can submit a job that will run at

    a later time.

  • 8/8/2019 01-Parallel Computing Explained

    23/237

    Arithmetic Parallelismy Multiple execution units

    y facilitate arithmetic parallelism.y The arithmetic operations of add, subtract, multiply, and divide (+ - * /) are

    each done in a separate execution unit. This allows several execution units to beused simultaneously, because the execution units operate independently.

    y F used multiply and add y is another parallel arithmetic feature.y Parallel computers are able to overlap multiply and add. This arithmetic is named

    MultiplyADD (MADD) on SGI computers, and Fused Multiply Add (FMA) onHP computers. In either case, the two arithmetic operations are overlapped andcan complete in hardware in one computer cycle.

    y Superscalar arithmeticy is the ability to issue several arithmetic operations per computer cycle.y It makes use of the multiple, independent execution units. On superscalar

    computers there are multiple slots per cycle that can be filled with work. Thisgives rise to the name n-way superscalar, where n is the number of slots percycle. TheSGI Origin2000is called a 4-way superscalar computer.

  • 8/8/2019 01-Parallel Computing Explained

    24/237

    Memory Parallelismy memory interleaving

    y memory is divided into multiple banks, and consecutive data elements areinterleaved among them. For example if your computer has 2 memory banks,then data elements with even memory addresses would fall into one bank, anddata elements with odd memory addresses into the other.

    y multiple memory portsy Port means a bi-directional memory pathway.W hen the data elements that are

    interleaved across the memory banks are needed, the multiple memory portsallow them to be accessed and fetched in parallel, which increases the memory bandwidth (MB/s or GB/s).

    y multiple levels of the memory hierarchy y There is global memory that any processor can access. There is memory that is

    local to a partition of the processors. Finally there is memory that is local to asingle processor, that is, the cache memory and the memory elements held inregisters.

    y C ache memory y C acheis a small memory that has fast access compared with the larger main

    memory and serves to keep the faster processor filled with data.

  • 8/8/2019 01-Parallel Computing Explained

    25/237

    Memory Parallelism

    Memory Hierarchy Cache Memory

  • 8/8/2019 01-Parallel Computing Explained

    26/237

    Disk Parallelismy RA ID (R edundant Array of I nexpensiveDisk)

    y RAID disks are on most parallel computers.y The advantage of a RAID disk system is that it provides a

    measure of fault tolerance.y If one of the disks goes down, it can be swapped out, and the

    RAID disk system remains operational.y DiskS triping

    y W hen a data set is written to disk, it is striped across the RAIDdisk system. That is, it is broken into pieces that are writtensimultaneously to the different disks in the RAID disk system.W hen the same data set is read back in, the pieces are read inparallel, and the full data set is reassembled in memory.

  • 8/8/2019 01-Parallel Computing Explained

    27/237

    Agenda1 Parallel Computing Overview

    1 .1 Introduction to Parallel Computing1 .1 .1 Parallelism in our Daily Lives1 .1 .2 Parallelism in Computer Programs1 .1 .3 Parallelism in Computers1 .1 .3.4 Disk Parallelism1 .1 .4 Performance Measures1 .1 .5 More Parallelism Issues

    1

    .2 Comparison of Parallel Computers1 .3 Summary

  • 8/8/2019 01-Parallel Computing Explained

    28/237

    Performance Measuresy Peak Performance

    y is the top speed at which the computer can operate.y It is a theoretical upper limit on the computer's performance.

    y Sustained Performancey is the highest consistently achieved speed.y

    It is a more realistic measure of computer performance.y C ost Performancey is used to determine if the computer is cost effective.

    y MHzy is a measure of the processor speed.y The processor speed is commonly measured in millions of cycles per second,

    where a computer cycle is defined as the shortest time in which some work can bedone.y MIP S

    y is a measure of how quickly the computer can issue instructions.y Millions of instructions per second is abbreviated as MIPS, where the instructions

    are computer instructions such as: memory reads and writes, logical operations ,floating point operations, integer operations, and branch instructions.

  • 8/8/2019 01-Parallel Computing Explained

    29/237

    Performance Measuresy Mflops(Millions of floating point operations per second)

    y measures how quickly a computer can perform floating-point operationssuch as add, subtract, multiply, and divide.

    y S peedupy measures the benefit of parallelism.y

    It shows how your program scales as you compute with more processors,compared to the performance on one processor.y Ideal speedup happens when the performance gain is linearly proportional to

    the number of processors used.y Benchmarks

    y are used to rate the performance of parallel computers and parallel

    programs.y A well known benchmark that is used to compare parallel computers is theLinpack benchmark.

    y Based on the Linpack results, a list is produced of theTop 500Supercomputer Sites. This list is maintained by the University of Tennesseeand the University of Mannheim.

  • 8/8/2019 01-Parallel Computing Explained

    30/237

    More Parallelism Issuesy Load balancing

    y is the technique of evenly dividing the workload among the processors.y For data parallelism it involves how iterations of loops are allocated to processoy Load balancing is important because the total time for the program to complete

    the time spent by the longest executing thread.y

    The problem sizey must be large and must be able to grow as you compute with more processors.y In order to get the performance you expect from a parallel computer you need t

    run a large application with large data sizes, otherwise the overhead of passinginformation between processors will dominate the calculation time.

    y Good software toolsy are essential for users of high performance parallel computers.y These tools include:

    y parallel compilersy parallel debuggersy performance analysis toolsy parallel math software

    y

    The availability of a broad set of application software is also important.

  • 8/8/2019 01-Parallel Computing Explained

    31/237

    More Parallelism Issuesy The high performance computing market is risky and chaotic. Many

    supercomputer vendors are no longer in business, making theportability of your application very important.

    y Aworkstation farmy is defined as a fast network connecting heterogeneous workstations.y The individual workstations serve as desktop systems for their owners.y W hen they are idle, large problems can take advantage of the unused

    cycles in the whole system.y An application of this concept is the SETI project. You can participate in

    searching for extraterrestrial intelligence with your home PC. Moreinformation about this project is available at theSETI Institute.

    y C ondor y is software that provides resource management services for applications that

    run on heterogeneous collections of workstations.y Miron Livny at the University of W isconsin at Madison is the director of the

    Condor project, and has coined the phrasehigh throughput computingto describethis process of harnessing idle workstation cycles. More information is availableat the Condor Home Page.

  • 8/8/2019 01-Parallel Computing Explained

    32/237

    Agenda1 Parallel Computing Overview

    1 .1 Introduction to Parallel Computing1 .2 Comparison of Parallel Computers

    1 .2.1 Processors1 .2.2 Memory Organization1 .2.3 Flow of Control1 .2.4 Interconnection Networks

    1 .2.4.1 Bus Network1 .2.4.2 Cross-Bar Switch Network1

    .2.4.3 Hypercube Network1 .2.4.4 Tree Network1 .2.4.5 Interconnection Networks Self-test

    1 .2.5 Summary of Parallel Computer Characteristics1 .3 Summary

  • 8/8/2019 01-Parallel Computing Explained

    33/237

    Comparison of Parallel Computersy Now you can explore the hardware components of parallel

    computers:y kinds of processorsy types of memory organizationy flow of controly interconnection networks

    y You will see what is common to these parallel computers,and what makes each one of them unique.

  • 8/8/2019 01-Parallel Computing Explained

    34/237

    Kinds of Processorsy There are three types of parallel computers:

    1 . computers with a small number of powerful processorsy Typically have tens of processors.y The cooling of these computers often requires very sophisticated and

    expensive equipment, making these computers very expensive for computincenters.

    y They are general-purpose computers that perform especially well onapplications that have large vector lengths.

    y The examples of this type of computer are theCray SV1 and theFujitsu

    VPP5000.

  • 8/8/2019 01-Parallel Computing Explained

    35/237

    Kinds of Processorsy There are three types of parallel computers:

    2. computers with a large number of less powerful processorsy Named aM assivelyP arallelP rocessor (MPP ), typically have thousands of

    processors.y The processors are usually proprietary and air-cooled.y Because of the large number of processors, the distance between the furthest

    processors can be quite large requiring a sophisticated internal network thatallows distant processors to communicate with each other quickly.

    y These computers are suitable for applications with a high degree of

    concurrency.y The MPP type of computer was popular in the1 980s.y Examples of this type of computer were theThinking Machines CM-2

    computer, and the computers made by the MassPar company.

  • 8/8/2019 01-Parallel Computing Explained

    36/237

    Kinds of Processorsy There are three types of parallel computers:

    3. computers that are medium scale in between the two extremesy Typically have hundreds of processors.y The processor chips are usually not proprietary; rather they are commodity

    processors like the Pentium III.y These are general-purpose computers that perform well on a wide range of

    applications.y The most common example of this class is the Linux Cluster.

  • 8/8/2019 01-Parallel Computing Explained

    37/237

    Trends and Examplesy Processor trends :

    y The processors on todays commonly used parallel computers:

    Decade Processor Type Computer Example1 970s Pipelined, Proprietary Cray-1

    1 980s Massively Parallel, Proprietary Thinking Machines CM2

    1 990s Superscalar, RISC, Commodity SGI Origin20002000s CISC, Commodity W orkstation Clusters

    Computer Processor

    SGI Origin2000 MIPS RISC R1 2000

    HP V-Class HP PA 8200

    Cray T3E Compaq Alpha

    IBM SP IBM Power3

    W orkstation Clusters Intel Pentium III, Intel Itanium

  • 8/8/2019 01-Parallel Computing Explained

    38/237

    Memory Organizationy The following paragraphs describe the three types of

    memory organization found on parallel computers:y distributed memoryy shared memoryy distributed shared memory

  • 8/8/2019 01-Parallel Computing Explained

    39/237

    Distributed Memoryy In distributed memory computers, the total memory is partitioned

    into memory that is private to each processor.y There is aN on-U niformM emory Access time (NU M A), which is

    proportional to the distance between the two communicatingprocessors.

    y On NUMA computers,data is accessed thequickest from a privatememory, while data fromthe most distant

    processor takes thelongest to access.

    y Some examples are theCray T3E, the IBM SP,and workstation clusters.

  • 8/8/2019 01-Parallel Computing Explained

    40/237

    Distributed Memoryy W hen programming distributed memory computers, the

    code and the data should be structured such that the bulk of a processors data accesses are to its own private (local)memory.

    y This is called havinggooddata locality .

    y Today's distributedmemory computers use

    message passingsuch asMPI to communicate between processors asshown in the followingexample:

  • 8/8/2019 01-Parallel Computing Explained

    41/237

    Distributed Memoryy One advantage of distributed memory computers is that they

    are easy to scale. As the demand for resources grows,computer centers can easily add more memory andprocessors.y This is often called theLEGO blockapproach.

    y The drawback is that programming of distributed memorycomputers can be quite complicated.

  • 8/8/2019 01-Parallel Computing Explained

    42/237

    Shared Memoryy In shared memory computers, all processors have access to a single pool

    of centralized memory with a uniform address space.y Any processor can address any memory location at the same speed so

    there isU niformM emory Access time (UMA ).y Processors communicate with each other through the shared memory.y The advantages and

    disadvantages of sharedmemory machines areroughly the opposite of distributed memorycomputers.

    y They are easier to program because they resemble theprogramming of singleprocessor machines

    y But they don't scale liketheir distributed memorycounterparts

  • 8/8/2019 01-Parallel Computing Explained

    43/237

    Distributed Shared Memoryy InDistributedSharedM emory (DSM ) computers, a cluster or partition of

    processors has access to a common shared memory.y It accesses the memory of a different processor cluster in a NUMA fashion.y Memory is physically distributed but logically shared.y Attention to data locality again is important.

    y Distributed shared memorycomputers combine the bestfeatures of both distributedmemory computers andshared memory computers.

    y That is, DSM computers have both the scalability of

    distributed memorycomputers and the ease of programming of sharedmemory computers.

    y Some examples of DSMcomputers are theSGIOrigin2000and theHP V-Classcomputers.

  • 8/8/2019 01-Parallel Computing Explained

    44/237

    Trends and Examplesy Memory organization

    trends:

    y The memory

    organization of todays commonlyused parallelcomputers:

    Decade Memory Organization Example1 970s Shared Memory Cray-1

    1 980s Distributed Memory Thinking Machines CM-21 990s Distributed Shared Memory SGI Origin2000

    2000s Distributed Memory W orkstation Clusters

    Computer Memory Organization

    SGI Origin2000 DSMHP V-Class DSM

    Cray T3E Distributed

    IBM SP DistributedW

    orkstation Clusters Distributed

  • 8/8/2019 01-Parallel Computing Explained

    45/237

    Flow of Controly W hen you look at the control of flow you will see three types

    of parallel computers:y SingleI nstructionM ultipleData (SIMD)y M ultipleI nstructionM ultipleData (MIMD)y SingleP rogramM ultipleData (SPMD)

  • 8/8/2019 01-Parallel Computing Explained

    46/237

    Flynns Taxonomyy Flynns Taxonomy, devised in1 972 by Michael Flynn of Stanford

    University, describes computers by how streams of instructions interactwith streams of data.

    y There can be single or multiple instruction streams, and there can besingle or multiple data streams. This gives rise to 4 types of computers asshown in the diagram below:

    y Flynn's taxonomynames the 4 computertypes SISD, MISD,SIMD and MIMD.

    y

    Of these 4, only SIMDand MIMD areapplicable to parallelcomputers.

    y Another computertype, SPMD, is a special

    case of MIMD.

  • 8/8/2019 01-Parallel Computing Explained

    47/237

    S IMD Computersy SIMDstands forSingleI nstructionM ultipleData.y Each processor follows the same set of instructions.y W ith different data elements being allocated to each processor.y SIMD computers have distributed memory with typically thousands of simple processors,

    and the processors run in lock step.y SIMD computers, popular in the1 980s, are useful for fine grain data parallel applications,

    such as neural networks.y Some examples of SIMD computers

    were the Thinking Machines CM-2computer and the computers from theMassPar company.

    y The processors are commanded by theglobal controller that sendsinstructions to the processors.

    y It saysadd , and they all add.y It saysshift to the right, and they all

    shift to the right.y The processors are like obedient

    soldiers, marching in unison.

  • 8/8/2019 01-Parallel Computing Explained

    48/237

    MIMD Computersy MIMDstands forM ultipleI nstructionM ultipleData.y There are multiple instruction streams with separate code segments distributed

    among the processors.y MIMD is actually a superset of SIMD, so that the processors can run the same

    instruction stream or different instruction streams.y

    In addition, there are multiple data streams; different data elements are allocatedto each processor.y MIMD computers can have either distributed memory or shared memory.y W hile the processors on SIMD

    computers run in lock step, theprocessors on MIMD computersrun independently of each other.

    y MIMD computers can be used foreither data parallel or task parallelapplications.

    y Some examples of MIMDcomputers are theSGI Origin2000computer and theHP V-Classcomputer.

  • 8/8/2019 01-Parallel Computing Explained

    49/237

    SPMD Computersy SPMDstands forSingleP rogramM ultipleData.y SPMD is a special case of MIMD.y SPMD execution happens when a MIMD computer is programmed to have the

    same set of instructions per processor.y W ith SPMD computers, while the processors are running the same code

    segment, each processor can run that code segment asynchronously.y Unlike SIMD, the synchronous execution of instructions is relaxed.y An example is the execution of an if statement on a SPMD computer.

    y Because each processor computes with its own partition of the data elements, itmay evaluate the right hand side of the if statement differently from anotherprocessor.

    y One processor may take a certain branch of the if statement, and anotherprocessor may take a different branch of the same if statement.

    y Hence, even though each processor has the same set of instructions, thoseinstructions may be evaluated in a different order from one processor to the next.

    y The analogies we used for describing SIMD computers can be modified forMIMD computers.

    y Instead of the SIMD obedient soldiers, all marching in unison, in the MIMD worldthe processors march to the beat of their own drummer.

  • 8/8/2019 01-Parallel Computing Explained

    50/237

    Summary of S IMD versus M IMDSIMD MIMD

    Memory distributed memorydistriuted memory

    orshared memory

    Code Segment same perprocessorsameor

    different

    ProcessorsRun In lock step asynchronously

    DataElements

    different perprocessor

    different perprocessor

    Applications data paralleldata parallel

    ortask parallel

  • 8/8/2019 01-Parallel Computing Explained

    51/237

    Trends and Examplesy Flow of control trends:

    y The flow of control on today:

    Decade Flow of Control Computer Example1 980's SIMD Thinking Machines CM-21 990's MIMD SGI Origin2000

    2000's MIMD W orkstation Clusters

    Computer Flow of Control

    SGI Origin2000 MIMDHP V-Class MIMD

    Cray T3E MIMD

    IBM SP MIMD

    W orkstation Clusters MIMD

  • 8/8/2019 01-Parallel Computing Explained

    52/237

    Agenda1 Parallel Computing Overview

    1 .1 Introduction to Parallel Computing1 .2 Comparison of Parallel Computers

    1 .2.1 Processors1

    .2.2 Memory Organization1 .2.3 Flow of Control1 .2.4 Interconnection Networks

    1 .2.4.1 Bus Network1 .2.4.2 Cross-Bar Switch Network1

    .2.4.3 Hypercube Network1 .2.4.4 Tree Network1 .2.4.5 Interconnection Networks Self-test

    1 .2.5 Summary of Parallel Computer Characteristics1 .3 Summary

  • 8/8/2019 01-Parallel Computing Explained

    53/237

    Interconnection Networksy Wh at exactly is t h e interconnection network?

    y The interconnection networkis made up of the wires and cables that define how themultiple processors of a parallel computer are connected to each other and to thememory units.

    y The time required to transfer data is dependent upon the specific type of theinterconnection network.

    y

    This transfer time is called the communication time.y Wh at network c h aracteristics are important?

    y Diameter: the maximum distance that data must travel for 2 processors tocommunicate.

    y Bandwidth: the amount of data that can be sent through a network connection.y Latency: the delay on a network while a data packet is being stored and forwarded.

    y Types of Interconnection NetworksThe network topologies (geometric arrangements of the computer network

    connections) are:y Busy Cross-bar Switchy Hybercubey

    Tree

  • 8/8/2019 01-Parallel Computing Explained

    54/237

    Interconnection Networksy The aspects of network issues are:

    y Costy Scalabilityy Reliabilityy Suitable Applicationsy

    Data Ratey Diametery Degree

    y G eneral Network C h aracteristicsy Some networks can be compared in terms of their degree and diameter.y Degree:how many communicating wires are coming out of each processor.

    y A large degree is a benefit because it has multiple paths.y Diameter:This is the distance between the two processors that are farthest

    apart.y A small diameter corresponds to low latency.

  • 8/8/2019 01-Parallel Computing Explained

    55/237

    Bus Networky Bus topology is the original coaxial cable-basedLocal Area N etwork

    (L AN ) topology in which the medium forms a single bus to which allstations are attached.

    y The positive aspectsy It is also a mature technology that is well known and reliable.y The cost is also very low.y simple to construct.

    y The negative aspectsy limited data

    transmission rate.y not scalable in termsof performance.

    y Example: SGI PowerChallenge.

    y Only scaled to1 8processors.

  • 8/8/2019 01-Parallel Computing Explained

    56/237

    Cross- Bar Switch Networky A cross-bar switch is a network that works through a switching mechanism to

    access shared memory.y it scales better than the bus network but it costs significantly more.

    y The telephone system uses this type of network. An example of a computerwith this type of network is the HP V-Class.

    y Here is a diagram of across-bar switchnetwork which showsthe processors talkingthrough theswitchboxes to store orretrieve data in

    memory.y There are multiplepaths for a processor tocommunicate with acertain memory.

    y The switches determinethe optimal route totake.

  • 8/8/2019 01-Parallel Computing Explained

    57/237

    Cross- Bar Switch Networky In a hypercube network, the processors are connected as if they

    were corners of a multidimensional cube. Each node in an Ndimensional cube is directly connected to N other nodes.

    y The fact that the number of directly

    connected, "nearest neighbor",nodes increases with the total size of the network is also highly desirablefor a parallel computer.

    y The degree of a hypercube network

    is log n and the diameter is log n,where n is the number of processors.

    y Examples of computers with thistype of network are the CM-2,NCUBE-2, and the Intel iPSC860.

  • 8/8/2019 01-Parallel Computing Explained

    58/237

    Tree Networky The processors are the bottom nodes of the tree. For a processor

    to retrieve data, it must go up in the network and then go backdown.

    y This is useful for decision making applications that can be mappedas trees.

    y The degree of a tree network is1 . The diameter of the network is2 log (n+1 )-2 where n is the number of processors.

    y The Thinking Machines CM-5 is anexample of a parallel computerwith this type of network.

    y Tree networks are very suitable fordatabase applications because itallows multiple searches throughthe database at a time.

  • 8/8/2019 01-Parallel Computing Explained

    59/237

    Interconnected Networksy Torus Network: A mesh with wrap-around connections in

    both the x and y directions.y Multistage Network: A network with more than one

    networking unit.y Fully Connected Network: A network where every processor

    is connected to every other processor.y Hypercube Network: Processors are connected as if they

    were corners of a multidimensional cube.y

    Mesh Network: A network where each interior processor isconnected to its four nearest neighbors.

  • 8/8/2019 01-Parallel Computing Explained

    60/237

    Interconnected Networksy Bus Based Network: Coaxial cable based LAN topology in

    which the medium forms a single bus to which all stations areattached.

    y Cross-bar Switch Network: A network that works through aswitching mechanism to access shared memory.

    y Tree Network: The processors are the bottom nodes of thetree.

    y Ring Network: Each processor is connected to two othersand the line of connections forms a circle.

  • 8/8/2019 01-Parallel Computing Explained

    61/237

    Summary of Parallel ComputerCharacteristicsy How many processors does the computer have?

    y 1 0s?y 1 00s?y 1 000s?

    y How powerful are the processors?y what's the MHz ratey what's the MIPS rate

    y W

    hat's the instruction set architecture?y RISCy CISC

  • 8/8/2019 01-Parallel Computing Explained

    62/237

    Summary of Parallel ComputerCharacteristicsy How much memory is available?

    y total memoryy memory per processor

    y W hat kind of memory?y distributed memoryy shared memoryy distributed shared memory

    y W hat type of flow of control?y

    SIMDy MIMDy SPMD

  • 8/8/2019 01-Parallel Computing Explained

    63/237

    Summary of Parallel ComputerCharacteristicsy W hat is the interconnection network?

    y Busy Crossbary Hypercubey Treey Torusy Multistagey Fully Connectedy Meshy Ringy Hybrid

  • 8/8/2019 01-Parallel Computing Explained

    64/237

    Design decisions made by some of themajor parallel computer vendors

    Computer ProgrammingStyle OS Processors MemoryFlow of Control Network

    SGIOrigin2000

    OpenMPMPI IRIX

    MIPS RISCR1 0000 DSM MIMD

    CrossbarHypercube

    HP V-Class OpenMPMPI HP-UX HP PA 8200 DSM MIMDCrossbarRing

    Cray T3E SHMEM Unicos Compaq Alpha Distributed MIMD Torus

    IBM SP MPI AIX IBM Power3 Distributed MIMD IBM Switch

    W orkstationClusters MPI Linux

    Intel PentiumIII Distributed MIMD

    MyrinetTree

  • 8/8/2019 01-Parallel Computing Explained

    65/237

    Summaryy This completes our introduction to parallel computing.y You have learned about parallelism in computer programs, and

    also about parallelism in the hardware components of parallelcomputers.

    y In addition, you have learned about the commonly used parallelcomputers, and how these computers compare to each other.

    y There are many good texts which provide an introductorytreatment of parallel computing. Here are two useful references:

    Highly Parallel C omputing,Second Edition

    George S. Almasi and Allan GottliebBenjamin/Cummings Publishers,1 994

    Parallel C omputing Theory and PracticeMichael J. QuinnMcGraw-Hill, Inc.,1 994

  • 8/8/2019 01-Parallel Computing Explained

    66/237

    Agenda1 Parallel Computing Overview2 How to Parallelize a Code

    2.1 Automatic Compiler Parallelism2.2 Data Parallelism by Hand2.3 Mixing Automatic and Hand Parallelism2.4 Task Parallelism2.5 Parallelism Issues

    3 Porting Issues4 Scalar Tuning5 Parallel Code Tuning6 Timing and Profiling7 Cache Tuning8 Parallel Performance Analysis9 About the IBM Regatta P690

  • 8/8/2019 01-Parallel Computing Explained

    67/237

    How to Parallelize a Codey This chapter describes how to turn a single processor

    program into a parallel one, focusing on shared memorymachines.

    y Both automatic compiler parallelization and parallelization byhand are covered.

    y The details for accomplishing both data parallelism and taskparallelism are presented.

  • 8/8/2019 01-Parallel Computing Explained

    68/237

    Automatic Compiler Parallelismy Automatic compiler parallelism enables you to use a

    single compiler option and let the compiler do the work.y The advantage of it is that its easy to use.y

    The disadvantages are:y The compiler only does loop level parallelism, not taskparallelism.

    y The compiler wants to parallelize every do loop in your code.If you have hundreds of do loops this creates way too muchparallel overhead.

  • 8/8/2019 01-Parallel Computing Explained

    69/237

    Automatic Compiler Parallelismy To use automatic compiler parallelism on a Linux system

    with the Intel compilers, specify the following.

    ifort - para ll e l -O2 ... prog.f

    y The compiler creates conditional code that will run with anynumber of threads.

    y Specify the number of threads and make sure you still get theright answers withsetenv :

    s etenv OMP_NUM_THREADS 4 a.out > re s u l t s

  • 8/8/2019 01-Parallel Computing Explained

    70/237

    Data Parallelism by Handy First identify the loops that use most of the CPU time (the Profiling

    lecture describes how to do this).y By hand, insert into the code OpenMP directive(s) just before the

    loop(s) you want to make parallel.y Some code modifications may be needed to remove data dependencies

    and other inhibitors of parallelism.y Use your knowledge of the code and data to assist the compiler.y For the SGI Origin2000 computer, insert into the code an OpenMP

    directive just before the loop that you want to make parallel.

    !$OMP PARALLELDO do i =1, n

    l ot s of c omputation ... end do

    !$OMP END PARALLEL DO

  • 8/8/2019 01-Parallel Computing Explained

    71/237

    Data Parallelism by Handy Compile with the mp compiler option.

    f90 - mp ... prog.f

    y As before, the compiler generates conditional code that will run with anynumber of threads.

    y If you want to rerun your program with a different number of threads, you donot need to recompile, just re-specify thesetenv command.s etenv OMP_NUM_THREADS 8a.out > re s u l t s2

    y The setenv command can be placed anywhere before thea.out command.y The setenv command must be typed exactly as indicated. If you have a typo,

    you will not receive a warning or error message. To make sure that the setenvcommand is specified correctly, type:s etenv

    y It produces a listing of your environment variable settings.

  • 8/8/2019 01-Parallel Computing Explained

    72/237

    Mixing Automatic and Hand Parallelismy You can have one source file parallelized automatically by the

    compiler, and another source file parallelized by hand.Suppose you split your code into two files named prog1.f and prog2.f .

    f90 -c - apo prog 1 .f ( automati c // for prog 1 .f )f90 -c - mp prog 2 .f ( by h and // for prog 2 .f )f90 prog 1 .o prog 2 .o (c reate s one exe c utab l e )a.out > re s u l t s ( run s t h e exe c utab l e )

  • 8/8/2019 01-Parallel Computing Explained

    73/237

    Task Parallelismy You can accomplish task parallelism as follows:

    !$OMP PARALLEL!$OMP SECTIONS l ot s of c omputation in part A !$OMP SECTION

    l ot s of c omputation in part B ... !$OMP SECTION l ot s of c omputation in part C ... !$OMP END SECTIONS!$OMP END PARALLEL

    y

    Compile with the mp compiler option.f90 - mp prog.fy Use thesetenv command to specify the number of threads.

    s etenv OMP_NUM_THREADS 3a.out > re s u l t s

  • 8/8/2019 01-Parallel Computing Explained

    74/237

    Parallelism Issuesy There are some issues to consider when parallelizing a

    program.y Should data parallelism or task parallelism be used?y

    Should automatic compiler parallelism or parallelism by hand be used?y W hich loop in a nested loop situation should be the

    one that becomes parallel?y How many threads should be used?

  • 8/8/2019 01-Parallel Computing Explained

    75/237

    Agenda1 Parallel Computing Overview2 How to Parallelize a Code3 Porting Issues

    3.1 Recompile3.2W ord Length3.3 Compiler Options for Debugging3.4 Standards Violations3.5 IEEE Arithmetic Differences3.6 Math Library Differences3.7 Compute Order Related Differences3.8 Optimization Level Too High3.9 Diagnostic Listings3.1 0 Further Information

  • 8/8/2019 01-Parallel Computing Explained

    76/237

  • 8/8/2019 01-Parallel Computing Explained

    77/237

    Recompiley Some codes just need to be recompiled to get accurate results.y The compilers available on the NCSA computer platforms are

    shown in the following table:

    Language SGI Origin2000 IA-32 Linux IA-64 Linux

    MIPSpro PortlandGroup Intel GNUPortlandGroup Intel GNU

    Fortran 77 f77 ifort g77 pgf77 ifort g77

    Fortran 90 f90 ifort pgf90 ifort

    Fortran 90 f95 ifort ifortHighPerformanceFortran

    pghpf pghpf

    C cc icc gcc pgcc icc gccC++ CC icpc g++ pgCC icpc g++

  • 8/8/2019 01-Parallel Computing Explained

    78/237

    Word Lengthy Code flaws can occur when you are porting your code to a

    different word length computer.y For C, the size of an integer variable differs depending on the

    machine and how the variable is generated. On the IA32 and IA64Linux clusters, the size of an integer variable is 4 and 8 bytes,respectively. On the SGI Origin2000, the corresponding value is 4 bytes if the code is compiled with the n32 flag, and 8 bytes if compiled without any flags or explicitly with the 64 flag.

    y For Fortran, the SGI MIPSpro and Intel compilers contain the

    following flags to set default variable size.y -in where n is a number: set the default INTEGER to INTEGER*n.The value of n can be 4 or 8 on SGI, and 2, 4, or 8 on the Linuxclusters.

    y -rn where n is a number: set the default REAL to REAL*n. The valueof n can be 4 or 8 on SGI, and 4, 8, or1 6 on the Linux clusters.

  • 8/8/2019 01-Parallel Computing Explained

    79/237

    Compiler Options for Debugging y On the SGI Origin2000, the MIPSpro compilers include

    debugging options via the DEBUG:group. The syntax is asfollows:-DEBUG: option 1[= va l ue 1]: option 2[= va l ue 2] ...

    y Two examples are:y Array-bound checking: check for subscripts out of range at

    runtime.-DEBUG:s ub sc ript _ch e ck=ON

    y

    Force all un-initialized stack, automatic and dynamicallyallocated variables to be initialized.-DEBUG: trap _ uninitia l ized =ON

  • 8/8/2019 01-Parallel Computing Explained

    80/237

    Compiler Options for Debugging y On the IA32 Linux cluster, the Fortran compiler is

    equipped with the following C flags for runtimediagnostics:y -CA: pointers and allocatable referencesy -CB: array and subscript boundsy -CS: consistent shape of intrinsic procedurey -CU: use of uninitialized variablesy -CV: correspondence between dummy and actual

    arguments

  • 8/8/2019 01-Parallel Computing Explained

    81/237

    Standards Violationsy Code flaws can occur when the program has non-ANSI

    standard Fortran coding.y ANSI standard Fortran is a set of rules for compiler writers that

    specify, for example, the value of the do loop index upon exitfrom the do loop.

    y S tandards Violations Detectiony To detect standards violations on the SGI Origin2000 computer

    use the -ansiflag.y This option generates a listing of warning messages for the use

    of non-ANSI standard coding.y On the Linux clusters, the -ansi[ -] flag enables/disables

    assumption of ANSI conformance.

  • 8/8/2019 01-Parallel Computing Explained

    82/237

    IEEE Arithmetic Differencesy Code flaws occur when the baseline computer conforms to the

    IEEE arithmetic standard and the new computer does not.y The IEEE Arithmetic Standard is a set of rules governing arithmetic

    roundoff and overflow behavior.y For example, it prohibits the compiler writer from replacing x/y

    with x *recip (y) since the two results may differ slightly for someoperands. You can make your program strictly conform to the IEEEstandard.

    y To make your program conform to the IEEE Arithmetic Standardson the SGI Origin2000 computer use:f90 -OPT:IEEE arithmetic=n ... prog. f where n is 1, 2, or 3.

    y This option specifies the level of conformance to the IEEEstandard where1 is the most stringent and 3 is the most liberal.y On the Linux clusters, the Intel compilers can achieve

    conformance to IEEE standard at a stringent level with the mpflag, or a slightly relaxed level with the mp1 flag.

  • 8/8/2019 01-Parallel Computing Explained

    83/237

    Math Library Differencesy Most high-performance parallel computers are equipped with

    vendor-supplied math libraries.y On the SGI Origin2000 platform, there areSGI/ C ray Scientific

    Library ( SCS L) andC omplib.sgimath.y SCSL contains Level1 , 2, and 3 Basic Linear Algebra Subprograms

    (BLAS), LAPACK and Fast Fourier Transform (FFT) routines.y SCSL can be linked with lscs for the serial version, or mp

    lscs_mp for the parallel version.y The complib library can be linked with lcomplib.sgimath for the

    serial version, or mp lcomplib.sgimath_mp for the parallelversion.

    y The Intel Math Kernel Library (MKL) contains the complete set of functions from BLAS, the extended BLAS (sparse), the completeset of LAPACK routines, and Fast Fourier Transform (FFT)routines.

  • 8/8/2019 01-Parallel Computing Explained

    84/237

    Math Library Differencesy On the IA32 Linux cluster, the libraries to link to are:

    y For BLAS:-L/usr /loca l/inte l/mk l/lib/32 -l mk l -lguide lp thread y For LAPACK:-L/usr /loca l/inte l/mk l/lib/32 l mk l_la pack -lmk l -lguide

    lp thread

    y W hen calling MKL routines from C/C++ programs, you alsoneed to link with lF90 .

    y On the IA64 Linux cluster, the corresponding libraries are:y For BLAS:-L/usr /loca l/inte l/mk l/lib/64 l mk l_ itp lp thread y

    For LAPACK:-L/usr /loca l/inte l/mk l/lib/64 l mk l_la pack lmk l_ itp lpthread y W hen calling MKL routines from C/C++ programs, you also

    need to link with-lPEPCF90 lCEPCF90 lF90 -l intrins

  • 8/8/2019 01-Parallel Computing Explained

    85/237

    Compute Order Related Differencesy Code flaws can occur because of the non-deterministic computation of

    data elements on a parallel computer. The compute order in which thethreads will run cannot be guaranteed.

    y For example, in a data parallel program, the 50th index of a do loop may becomputed before the1 0th index of the loop. Furthermore, the threads may

    run in one order on the first run, and in another order on the next run of theprogram.y N ote:: If your algorithm depends on data being compared in a specific order,

    your code is inappropriate for a parallel computer.y Use the following method to detect compute order related differences:

    y If your loop looks likey DO I = 1 , N change it toy DO I = N, 1 , -1 The results should not change if the iterations are

    independent

  • 8/8/2019 01-Parallel Computing Explained

    86/237

    Optimization Level Too Highy Code flaws can occur when the optimization level has been set too

    high thus trading speed for accuracy.y The compiler reorders and optimizes your code based on

    assumptions it makes about your program. This can sometimes cause

    answers to change at higher optimization level.y S etting t h e Optimization Levely Both SGI Origin2000 computer and IBM Linux clusters provide

    Level 0 (no optimization) to Level 3 (most aggressive) optimization,using the O{0,1 ,2, or 3} flag. One should bear in mind that Level 3

    optimization may carry out loop transformations that affect thecorrectness of calculations. Checking correctness and precision of calculation is highly recommended when O3 is used.

    y For example on the Origin 2000y f90 -O0 prog.f turns off all optimizations.

  • 8/8/2019 01-Parallel Computing Explained

    87/237

  • 8/8/2019 01-Parallel Computing Explained

    88/237

    Diagnostic Listingsy The SGI Origin 2000 compiler will generate all

    kinds of diagnostic warnings and messages, butnot always by default. Some useful listing options

    are:f90 -l i s ting ... f90 - fu llw arn ... f90 -sh o wdefau l t s ... f90 - ver s ion ... f90 -h e l p ...

  • 8/8/2019 01-Parallel Computing Explained

    89/237

    Further Informationy SG I

    y man f77/f90/ccy man debug_groupy man mathy

    man complib.sgimathy MIPSpro 64-Bit Porting and Transition Guidey Online Manuals

    y Linux clusters pagesy ifort/icc/icpc help (IA32, IA64, Intel64)y Intel Fortran Compiler for Linuxy Intel C/C++ Compiler for Linux

  • 8/8/2019 01-Parallel Computing Explained

    90/237

    Agenday 1 Parallel Computing Overviewy 2 How to Parallelize a Codey 3 Porting Issuesy

    4 Scalar Tuningy 4.1 Aggressive Compiler Optionsy 4.2 Compiler Optimizationsy 4.3 Vendor Tuned Codey

    4.4 Further Information

  • 8/8/2019 01-Parallel Computing Explained

    91/237

    Scalar Tuning y If you are not satisfied with the performance of your

    program on the new computer, you can tune the scalar codeto decrease its runtime.

    y This chapter describes many of these techniques:y The use of the most aggressive compiler optionsy The improvement of loop unrollingy The use of subroutine inliningy The use of vendor supplied tuned code

    y The detection of cache problems, and their solution arepresented in the Cache Tuning chapter.

  • 8/8/2019 01-Parallel Computing Explained

    92/237

  • 8/8/2019 01-Parallel Computing Explained

    93/237

    Aggressive Compiler Optionsy It should be noted that O3 might carry out loop

    transformations that produce incorrect results in some codes.y It is recommended that one compare the answer obtained from

    Level 3 optimization with one obtained from a lower-leveloptimization.

    y On the SGI Origin2000 and the Linux clusters, O3 can beused together with OPT:IEEE_arithmetic=n (n=1 ,2, or 3)and mp (or mp1 ), respectively, to enforce operationconformance to IEEE standard at different levels.

    y

    On the SGI Origin2000, the option-Of ast = ip27is also available. This option specifies the most aggressiveoptimizations that are specifically tuned for the Origin2000computer.

  • 8/8/2019 01-Parallel Computing Explained

    94/237

    Agenday 1 Parallel Computing Overviewy 2 How to Parallelize a Codey 3 Porting Issuesy 4 Scalar Tuning

    y 4.1 Aggressive Compiler Optionsy 4.2 Compiler Optimizations

    y 4.2.1 Statement Levely 4.2.2 Block Levely 4.2.3 Routine Levely 4.2.4 Software Pipeliningy 4.2.5 Loop Unrollingy 4.2.6 Subroutine Inliningy 4.2.7 Optimization Reporty 4.2.8 Profile-guided Optimization (PGO)

    y 4.3 Vendor Tuned Codey 4.4 Further Information

  • 8/8/2019 01-Parallel Computing Explained

    95/237

    Compiler Optimizationsy The various compiler optimizations can be classified as

    follows:y Statement Level Optimizationsy Block Level Optimizationsy Routine Level Optimizationsy Software Pipeliningy Loop Unrollingy Subroutine Inlining

    y Each of these are described in the following sections.

  • 8/8/2019 01-Parallel Computing Explained

    96/237

    Statement Levely Constant Folding

    y Replace simple arithmetic operations on constants with the pre-computed result.

    y y = 5+7 becomes y =1 2y S h ort Circuiting

    y Avoid executing parts of conditional tests that are not necessary.y if (I.eq.J .or. I.eq.K) expression

    when I=J immediately compute the expressiony R egister A ssignment

    y Put frequently used variables in registers.

  • 8/8/2019 01-Parallel Computing Explained

    97/237

    Block Levely Dead C odeElimination

    y Remove unreachable code and code that is never executed orused.

    y InstructionSchedulingy Reorder the instructions to improve memory pipelining.

  • 8/8/2019 01-Parallel Computing Explained

    98/237

    Routine Levely S trengthR eduction

    y Replace expressions in a loop with an expression that takes fewercycles.

    y C ommonSubexpressionsEliminationy

    Expressions that appear more than once, are computed once, and theresult is substituted for each occurrence of the expression.y C onstant Propagation

    y Compile time replacement of variables with constants.y L

    oop InvariantE

    liminationy Expressions inside a loop that don't change with the do loop index aremoved outside the loop.

  • 8/8/2019 01-Parallel Computing Explained

    99/237

    Software Pipelining y Software pipelining allows the mixing of operations from

    different loop iterations in each iteration of the hardwareloop. It is used to get the maximum work done per clockcycle.

    y N ote:On the R1 0000s there is out-of-order execution of instructions, and software pipelining may actually get in theway of this feature.

  • 8/8/2019 01-Parallel Computing Explained

    100/237

    Loop Unrolling y The loops stride (or step) value is increased, and the body of the loop isreplicated. It is used to improve the scheduling of the loop by giving a

    longer sequence of straight line code. An example of loop unrollingfollows:

    Original Loop U nrolled Loopdo I = 1, 99 do I = 1, 99 , 3c(I) = a (I) + b (I) c(I) = a (I) + b (I)enddo c(I+1) = a (I+1) + b (I+1)

    c(I+2) = a (I+2) + b (I+2)enddo

    There is a limit to the amount of unrolling that can take place because thereare a limited number of registers.

    y On the SGI Origin2000, loops are unrolled to a level of 8 by default.You can unroll to a level of 1 2 by specifying:

    f90 -O3 -OPT: unro ll_ time s_ max =12 ... prog.fy On the IA32 Linux cluster, the corresponding flag is unro ll and -unro ll0

    for unrolling and no unrolling, respectively.

  • 8/8/2019 01-Parallel Computing Explained

    101/237

    Subroutine Inlining y Subroutine inlining replaces a call to a subroutine with

    the body of the subroutine itself.y One reason for using subroutine inlining is that when a

    subroutine is called inside a do loop that has a hugeiteration count, subroutine inlining may be moreefficient because it cuts down on loop overhead.

    y However, the chief reason for using it is that do loops

    that contain subroutine calls may not parallelize.

  • 8/8/2019 01-Parallel Computing Explained

    102/237

    Subroutine Inlining y On the SGI Origin2000 computer, there are several options to

    invoke inlining:y Inline all routines except those specified to -INLINE:never

    f90 -O3 -INLINE: a ll prog.f :y Inline no routines except those specified to -INLINE:must

    f90 -O3 -INLINE: none prog.f :y Specify a list of routines to inline at every call

    f90 -O3 -INLINE: mu s t =s ubrname prog.f :y Specify a list of routines never to inline

    f90 -O3 -INLINE: never =s ubrname prog.f :y On the Linux clusters, the following flags can invoke function inlining:

    y inline function expansion for calls defined within the current source file- ip :

    y inline function expansion for calls defined in separate files- ipo :

  • 8/8/2019 01-Parallel Computing Explained

    103/237

    Optimization Reporty Intel 9.x and later compilers can generate reports that provide

    useful information on optimization done on different parts of yourcode.y To generate such optimization reports in a file filename, add the flag -

    opt-report-file filename.y If you have a lot of source files to process simultaneously, and you use

    a makefile to compile, you can also use make's "suffix" rules to haveoptimization reports produced automatically, each with a uniquename. For example,.f.o :

    ifort -c - o $@ $(FFLAGS) - opt - report - fi l e $* .opt $* .f y creates optimization reports that are named identically to the original

    Fortran source but with the suffix ".f" replaced by ".opt".

  • 8/8/2019 01-Parallel Computing Explained

    104/237

    Optimization Reporty To help developers and performance analysts navigate through the

    usually lengthy optimization reports, the NCSA program OptView isdesigned to provide an easy-to-use and intuitive interface that allows theuser to browse through their own source code, cross-referenced withthe optimization reports.

    y

    OptView is installed on NCSA's IA64 Linux cluster under the directory/usr /a pp s /too ls /bin . You can either add that directory to your UNIXPATH or you can invoke optview using an absolute path name. You'llneed to be using the X-W indow system and to have set your DISPLAYenvironment variable correctly for OptView to work.

    y Optview can provide a quick overview of which loops in a source codeor source codes among multiple files are highly optimized and whichmight need further work. For a detailed description of use of OptView,readers see:http://perfsuite.ncsa.uiuc.edu/OptView/

  • 8/8/2019 01-Parallel Computing Explained

    105/237

    Profile-guided Optimization (PGO)y Profile-guided optimization allows Intel compilers to use

    valuable runtime information to make better decisions aboutfunction inlining and interprocedural optimizations togenerate faster codes. Its methodology is illustrated as

    follows:

  • 8/8/2019 01-Parallel Computing Explained

    106/237

  • 8/8/2019 01-Parallel Computing Explained

    107/237

    Vendor Tuned Codey Vendor math libraries have codes that are optimized for their

    specific machine.y On the SGI Origin2000 platform, Complib.sgimath and SCSL

    are available.y On the Linux clusters, Intel MKL is available.W ays to link to

    these libraries are described inSection3 - Porting Issues.

  • 8/8/2019 01-Parallel Computing Explained

    108/237

    Further Informationy SG I I R IX man and www pages

    y man opty man lnoy man inliney man ipay man perfexy Performance Tuning for the Origin2000 at

    http://www.ncsa.uiuc.edu/UserInfo/Resources/Hardware/Origin2000OLD/Doc/

    y Linux clusters h elp and www pagesy ifort/icc/icpc help (Intel)y http://www.ncsa.uiuc.edu/UserInfo/Resources/Hardware/Intel64Cluster/

    (Intel64)y http://www.ncsa.uiuc.edu/UserInfo/Resources/Hardware/Intel64Cluster/

    (Intel64)y http://perfsuite.ncsa.uiuc.edu/OptView/

  • 8/8/2019 01-Parallel Computing Explained

    109/237

    Agenda1 Parallel Computing Overview2 How to Parallelize a Code3 Porting Issues4 Scalar Tuning5 Parallel Code Tuning

    5.1 Sequential Code Limitation5.2 Parallel Overhead

    5.3 Load Balance5.3.1 Loop Schedule Types5.3.2 Chunk Size

  • 8/8/2019 01-Parallel Computing Explained

    110/237

    Parallel Code Tuning y This chapter describes several of the most common

    techniques for parallel tuning, the type of programs that benefit, and the details for implementing them.

    y The majority of this chapter deals with improving load balancing.

  • 8/8/2019 01-Parallel Computing Explained

    111/237

    Sequential Code Limitationy Sequential code is a part of the program that cannot be run withmultiple processors. Some reasons why it cannot be made data

    parallel are:y The code is not in a do loop.y The do loop contains a read or write.y

    The do loop contains a dependency.y The do loop has an ambiguous subscript.y The do loop has a call to a subroutine or a reference to a function

    subprogram.y S equential Code Fraction

    y As shown by Amdahls Law, if the sequential fraction is too large,there is a limitation on speedup. If you think too much sequentialcode is a problem, you can calculate the sequential fraction of codeusing the Amdahls Law formula.

  • 8/8/2019 01-Parallel Computing Explained

    112/237

    Sequential Code Limitationy M easuring t h e S equential Code Fraction

    y Decide how many processors to use, this is p.y Run and time the program with1 processor to give T(1 ).y Run and time the program with p processors to give T(2).y Form a ratio of the 2 timings T(1 )/T(p), this is SP.y Substitute SP and p into the Amdahls Law formula:

    y f=(1/ S P-1/p)/(1-1/p) , where f is the fraction of sequential code.y Solve for f, this is the fraction of sequential code.

    y Decreasing t h e S equential Code Fractiony The compilation optimization reports list which loops could not be

    parallelized and why. You can use this report as a guide to improveperformance on do loops by:

    y Removing dependenciesy Removing I/Oy Removing calls to subroutines and function subprograms

  • 8/8/2019 01-Parallel Computing Explained

    113/237

    Parallel Overheady Parallel overhead is the processing time spent

    y creating threadsy spin/blocking threadsy starting and ending parallel regionsy synchronizing at the end of parallel regionsy W hen the computational work done by the parallel processes is too

    small, the overhead time needed to create and control the parallelprocesses can be disproportionately large limiting the savings due toparallelism.

    y M easuring Parallel Over h eady To get a rough under-estimate of parallel overhead:

    y Run and time the code using1 processor.y Parallelize the code.y Run and time the parallel code using only1 processor.y Subtract the 2 timings.

  • 8/8/2019 01-Parallel Computing Explained

    114/237

    Parallel Overheady R educing Parallel Over h ead

    y To reduce parallel overhead:y Don't parallelize all the loops.y Don't parallelize small loops.

    y To benefit from parallelization, a loop needs about1 000 floating

    point operations or 500 statements in the loop. You can use the IFmodifier in the OpenMP directive to control when loops areparallelized.

    !$OMP PARALLEL DO IF( n > 5 00 )do i =1, n ... body of l oop ... end do !$OMP END PARALLEL DO

    y Use task parallelism instead of data parallelism. It doesn't generate asmuch parallel overhead and often more code runs in parallel.

    y Don't use more threads than you need.y

    Parallelize at the highest level possible.

  • 8/8/2019 01-Parallel Computing Explained

    115/237

    Load Balancey Load balance

    y is the even assignment of subtasks to processors so as to keep eachprocessor busy doing useful work for as long as possible.

    y Load balance is important for speedup because the end of a do loop isa synchronization point where threads need to catch up with eachother.

    y If processors have different work loads, some of the processors willidle while others are still working.

    y M easuring Load Balancey On the SGI Origin, to measure load balance, use the perfex tool

    which is a command line interface to the R1 0000 hardware counters.

    The commandper f ex -e 16 - mp a.out > resu lts y reports per thread cycle counts. Compare the cycle counts to

    determine load balance problems. The master thread (thread 0)always uses more cycles than the slave threads. If the counts are vastlydifferent, it indicates load imbalance.

  • 8/8/2019 01-Parallel Computing Explained

    116/237

    Load Balancey For linux systems, the thread cpu times can be compared

    with ps. A thread with unusually high or low time comparedto the others may not be working efficiently [high cputimecould be the result of a thread spinning while waiting for

    other threads to catch up].ps uH

    y Improving Load Balancey To improve load balance, try changing the way that loop

    iterations are allocated to threads byy changing the loop schedule typey changing the chunk sizey These methods are discussed in the following sections.

  • 8/8/2019 01-Parallel Computing Explained

    117/237

    Loop Schedule Typesy On the SGI Origin2000 computer, 4 different loop schedule

    types can be specified by an OpenMP directive. They are:y Staticy Dynamicy Guidedy Runtime

    y If you don't specify a schedule type, the default will be used.y Default S ch edule Type

    y The default schedule type allocates 20 iterations on 4 threads as:

  • 8/8/2019 01-Parallel Computing Explained

    118/237

    Loop Schedule Typesy S tatic S ch edule Type

    y The static schedule type is used when some of the iterations do morework than others.W ith the static schedule type, iterations areallocated in a round-robin fashion to the threads.

    y A n Exampley Suppose you are computing on theupper triangle of a1 00 x 1 00

    matrix, and you use 2 threads,named t0 and t1 . W ith defaultscheduling, workloads are uneven.

  • 8/8/2019 01-Parallel Computing Explained

    119/237

    Loop Schedule Typesy W hereas with static scheduling, the columns of the matrix

    are given to the threads in a round robin fashion, resulting in better load balance.

  • 8/8/2019 01-Parallel Computing Explained

    120/237

    Loop Schedule Typesy Dynamic S ch edule Type

    y The iterations are dynamically allocated to threads at runtime. Eachthread is given a chunk of iterations.W hen a thread finishes its work,it goes into a critical section where its given another chunk of iterations to work on.

    y

    This type is useful when you dont know the iteration count or workpattern ahead of time. Dynamic gives good load balance, but at a highoverhead cost.

    y G uided S ch edule Typey The guided schedule type is dynamic scheduling that starts with large

    chunks of iterations and ends with small chunks of iterations. That is,the number of iterations given to each thread depends on the numberof iterations remaining. The guided schedule type reduces the numberof entries into the critical section, compared to the dynamic scheduletype. Guided gives good load balancing at a low overhead cost.

  • 8/8/2019 01-Parallel Computing Explained

    121/237

    Chunk Sizey

    The word chunk refers to a grouping of iterations.C hunk sizemeanshow many iterations are in the grouping. The static and dynamicschedule types can be used with a chunk size. If a chunk size is notspecified, then the chunk size is1 .

    y Suppose you specify a chunk size of 2 with the static schedule type.Then 20 iterations are allocated on 4 threads:

    y The schedule type and chunk size are specified as follows:!$OMP PARALLEL DO SCHEDULE( type , ch un k)!$OMP END PARALLEL DO

    y W here typeis STATIC, or DYNAMIC, or GUIDED andchunkis anypositive integer.

  • 8/8/2019 01-Parallel Computing Explained

    122/237

    Agenda1 Parallel Computing Overview2 How to Parallelize a Code3 Porting Issues4 Scalar Tuning5 Parallel Code Tuning

    6 Timing and Profiling6.1 Timing

    6.1 .1 Timing a Section of Code6.1 .1 .1 CPU Time6.1 .1 .2 W all clock Time

    6.1 .2 Timing an Executable

    6.1

    .3 Timing a Batch Job6.2 Profiling6.2.1 Profiling Tools6.2.2 Profile Listings6.2.3 Profiling Analysis

    6.3 Further Information

  • 8/8/2019 01-Parallel Computing Explained

    123/237

    Timing and Profiling y Now that your program has been ported to the new

    computer, you will want to know how fast it runs.y This chapter describes how to measure the speed of a

    program using various timing routines.y The chapter also covers how to determine which parts of the

    program account for the bulk of the computational load sothat you can concentrate your tuning efforts on thosecomputationally intensive parts of the program.

  • 8/8/2019 01-Parallel Computing Explained

    124/237

    Timing y In the following sections, well discuss timers and review theprofiling tools ssrun and prof on the Origin and vprof and gprof

    on the Linux Clusters. The specific timing functions described are:y Timing a section of code

    F ORT RA N y

    etime, dtime, cpu_time for CPU timey time and f_time for wallclock timeC

    y clock for CPU timey gettimeofday for wallclock time

    y

    Timing an executabley time a.outy Timing a batch run

    y busagey qstaty qhist

  • 8/8/2019 01-Parallel Computing Explained

    125/237

    CPU Timey etime

    y A section of code can be timed using etime.y It returns the elapsed CPU time in seconds since the program

    started.

    rea l*4 tarray (2), time 1, time 2, timere s beginning of program time 1= etime ( tarray ) s tart of s e c tion of c ode to be timed l ot s of c omputation

    end of s e c tion of c ode to be timed time 2= etime ( tarray )timere s= time 2- time 1

  • 8/8/2019 01-Parallel Computing Explained

    126/237

    CPU Timey dtime

    y A section of code can also be timed using dtime.y It returns the elapsed CPU time in seconds since the last call to

    dtime.

    rea l*4 tarray (2), timere s beginning of program timere s= dtime ( tarray ) s tart of s e c tion of c ode to be timed l ot s of c omputation

    end of s e c tion of c ode to be timed timere s= dtime ( tarray ) re s t of program

  • 8/8/2019 01-Parallel Computing Explained

    127/237

    CPU TimeT h e etime and dtime Functionsy U ser time.

    y This is returned as the first element of tarray.y Its the CPU time spent executing user code.

    y S ystem time.y This is returned as the second element of tarray.y Its the time spent executing system calls on behalf of your program.

    y Sum of user and system time.y This is the function value that is returned.y

    Its the time that is usually reported.y Metric.

    y Timings are reported in seconds.y Timings are accurate to1 / 1 00th of a second.

  • 8/8/2019 01-Parallel Computing Explained

    128/237

  • 8/8/2019 01-Parallel Computing Explained

    129/237

    CPU Timecpu_timey The cpu_time routine is available only on the Linux clusters as it is

    a component of the Intel FORTRAN compiler library.y It provides substantially higher resolution and has substantially

    lower overhead than the older etime and dtime routines.y

    It can be used as an elapsed timer.rea l*8 time 1, time 2, timere s beginning of program c a ll c pu _ time ( time 1) s tart of s e c tion of c ode to be timed l ot s of c omputation end of s e c tion of c ode to be timed c a ll c pu _ time ( time 2)timere s= time 2- time 1 re s t of program

  • 8/8/2019 01-Parallel Computing Explained

    130/237

    CPU Timeclocky For C programmers, one can call the cpu_time routine using a

    FORTRAN wrapper or call the intrinsic function clock that can beused to determine elapsed CPU time.

    in cl ude s tati c c on s t doub l e i CPS =

    1 .0 /( doub l e )CLOCKS_PER_SEC;doub l e time 1, time 2, timre s;time 1=(cl o ck()* i CPS);/* do s ome wor k */time 2=(cl o ck()* i CPS);timer s= time 2- time 1;

  • 8/8/2019 01-Parallel Computing Explained

    131/237

    Wall clock Timetimey For the Origin, the functiontimereturns the time since

    00:00:00 GMT, Jan.1 , 1 970.y It is a means of getting the elapsed wall clock time.y

    The wall clock time is reported in integer seconds.external time integer*4 time1 ,time2,timeres

    beginning of program time 1= time ( )

    s tart of s e c tion of c ode to be timed l ot s of c omputation end of s e c tion of c ode to be timed time 2= time ( )timere s= time 2 - time 1

    ll l k

  • 8/8/2019 01-Parallel Computing Explained

    132/237

    Wall clock Timef_timey For the Linux clusters, the appropriate FORTRAN function for elapsed

    time is f_time.

    integer *8 f _ timeexterna l f _ time

    integer *8 time 1, time 2, timere s beginning of program time 1= f _ time () s tart of s e c tion of c ode to be timed l ot s of c omputation end of s e c tion of c ode to be timed time 2= f _ time ()timere s= time 2 - time 1

    y As above for etime and dtime, the f_time function is in the VAXcompatibility library of the Intel FORTRAN Compiler. To use thislibrary include the compiler flag -Vaxlib.

    W ll l k Ti

  • 8/8/2019 01-Parallel Computing Explained

    133/237

    Wall clock Timegettimeofdayy For C programmers, wallclock time can be obtained by using the very

    portable routine gettimeofday.

    # in cl ude /* definition of NULL */# in cl ude /* definition of timeva l s tru c t and

    protyping of gettimeofday */doub l e t 1, t 2, e l ap s ed ;s tru c t timeva l tp ;int rtn ;.... .... rtn =gettimeofday (& tp , NULL);

    t 1=( doub l e ) tp.tv _s e c+(1 .e -6)* tp.tv _ u s e c;.... /* do s ome wor k */.... rtn =gettimeofday (& tp , NULL);t 2=( doub l e ) tp.tv _s e c+(1 .e -6)* tp.tv _ u s e c;e l ap s ed =t 2- t 1;

    Ti i E bl

  • 8/8/2019 01-Parallel Computing Explained

    134/237

    Timing an Executabley To time an executable (if using a csh or tcsh shell, explicitly

    call /usr/bin/time)

    time option s a.out

    y whereoptionscan be -p for a simple output or -f f ormat which allows the user to display more than just time relatedinformation.

    y

    Consult the man pages on the time command for formatoptions.

    Ti i B h J b

  • 8/8/2019 01-Parallel Computing Explained

    135/237

    Timing a Batch Joby Time of a batch job running or completed.

    y Originbu s age j obid

    y Linux clustersq s tat j obid # for a running j ob q h i s t j obid # for a c omp l eted j ob

    A d

  • 8/8/2019 01-Parallel Computing Explained

    136/237

    Agenda1

    Parallel Computing Overview2 How to Parallelize a Code3 Porting Issues4 Scalar Tuning5 Parallel Code Tuning

    6 Timing and Profiling6.1 Timing6.1 .1 Timing a Section of Code

    6.1 .1 .1 CPU Time6.1 .1 .2 W all clock Time

    6.1 .2 Timing an Executable

    6.1

    .3 Timing a Batch Job6.2 Profiling6.2.1 Profiling Tools6.2.2 Profile Listings6.2.3 Profiling Analysis

    6.3 Further Information

    P fili

  • 8/8/2019 01-Parallel Computing Explained

    137/237

    Profiling y Profiling determines where a program spends its time.

    y It detects the computationally intensive parts of the code.

    y Use profiling when you want to focus attention andoptimization efforts on those loops that are responsible forthe bulk of the computational load.

    y Most codes follow the9 0 -10 R ule.y That is, 90% of the computation is done in1 0% of the code.

    P fili T l

  • 8/8/2019 01-Parallel Computing Explained

    138/237

    Profiling ToolsProfiling T ools on t h e Originy On the SGI Origin2000 computer there are profiling tools named

    ssrun and prof.y Used together they do profiling, or what is called hot spot analysis.y They are useful for generating timing profiles.

    y

    ssruny The ssrun utility collects performance data for an executable that youspecify.

    y The performance data is written to a file named"executablename.exptype.id".

    y prof y The prof utility analyzes the data file created by ssrun and produces a

    report.y Example

    ss run - fp cs amp a.outprof -h a.out.fp cs amp.m 12345 > prof. l i s t

    P fili T l

  • 8/8/2019 01-Parallel Computing Explained

    139/237

    Profiling ToolsProfiling T ools on t h e Linux Clustersy On the Linux clusters the profiling tools are still maturing. There are

    currently several efforts to produce tools comparable to the ssrun,prof and perfex tools. .

    y gprof y Basic profiling information can be generated using the OS utility gprof.y First, compile the code with the compiler flags -qp -g for the Intel

    compiler (-g on the Intel compiler does not change the optimizationlevel) or -pg for the GNU compiler.

    y Second, run the program.y Finally analyze the resulting gmon.out file using the gprof utility: gprof

    exe c utab l e gmon.out.

    ef c -O - q p - g - o foo foo.f. / foogprof foo gmon.out

    P fili g T l

  • 8/8/2019 01-Parallel Computing Explained

    140/237

    Profiling Tools

    Profiling T ools on t h e Linux Clustersy vprof

    y On the IA32 platform there is a utility called vprof that providesperformance information using the PAPI instrumentation

    library.y To instrument the whole application requires recompiling and

    linking to vprof and PAPI libraries.

    s etenv VMON PAPI_TOT_CYCif c - g -O - o md md.f

    / u s r / app s/ too ls/ vprof /l ib / vmonauto _ g cc .o -L/ u s r / app s/ too ls/l ib -l vmon -l papi

    . / md / u s r / app s/ too ls/ vprof / bin /c prof - e md vmon.out

    P fil Li ti g

  • 8/8/2019 01-Parallel Computing Explained

    141/237

    Cy cl e s % C um % S e cs P ro c-------- ----- ----- ---- ----4263 09 84 58 . 47 58 . 47 0. 57 VSUB

    64 9 82 9 4 8 .9 1 67 . 38 0.09 PFSOR6141611 8 . 42 75 . 81 0.0 8 PBSOR365412 0 5 .0 1 8 0. 82 0.0 5 PFSOR1261586 0 3 . 5 9 84 . 41 0.0 3 VADD158 0 424 2 . 17 86 . 57 0.0 2 ITSRCG1144 0 36 1 . 57 88 . 14 0.0 2 ITSRSI

    886 0 44 1 . 22 8 9. 36 0.0 1 ITJSI

    861136 1 . 18 90. 54 0.0 1 ITJCG

    Profile Listings

    Profile Listings on t h e Originy Prof Output First Listing

    y The first listing gives the number of cycles executed in eachprocedure (or subroutine). The procedures are listed indescending order of cycle count.

    Profile Listings

  • 8/8/2019 01-Parallel Computing Explained

    142/237

    Cy cl e s % C um % L ine Pro c-------- ----- ----- ---- ----36556 9 44 5 0. 14 5 0. 14 81 0 6 VSUB

    53131 9 8 7 . 2 9 57 . 43 6 9 74 PFSOR4 9 688 0 4 6 . 81 64 . 24 6671 PBSOR2 9 8 9 882 4 . 1 0 68 . 34 81 0 7 VSUB2564544 3 . 52 71 . 86 7 09 7 PFSOR11 9 8842 0 2 . 73 74 . 5 9 81 0 3 VSUB162 9 776 2 . 24 76 . 82 8 0 45 VADD

    99 421 0 1 . 36 78 . 1 9 81 0 8 VSUB9 6 90 56 1 . 33 7 9. 52 8 0 4 9 VADD483 0 18 0. 66 8 0. 18 6 9 72 PFSOR

    Profile Listings

    Profile Listings on t h e Originy Prof Output S econd Listing

    y The second listing gives the number of cycles per sourcecode line.

    y The lines are listed in descending order of cycle count.

    Profile Listings

  • 8/8/2019 01-Parallel Computing Explained

    143/237

    Fl at profi l e :

    Ea ch s amp l e c ount s a s 0.0009 76562 s e c ond s .% c umu l ative s e l f s e l f tota l

    time s e c ond s s e c ond s c a lls u s/c a ll u s/c a ll name

    ----- ---------- ------- ----- ------- ------- -----------38 .0 7 5 . 67 5 . 67 1 0 1 56157 . 18 1 0 745 0. 88 c ompute _34 . 72 1 0. 84 5 . 17 251 99 5 00 0. 21 0. 21 di s t _25 . 48 14 . 64 3 . 8 0 SIND_SINCOS

    1 . 25 14 . 83 0. 1 9 s in0. 37 14 . 88 0.0 6 c o s0.0 5 14 . 8 9 0.0 1 5 0 5 00 0. 15 0. 15 dotr 8_0.0 5 14 .90 0.0 1 1 00 68 . 36 68 . 36 update _0.0 1 14 .90 0.00 f _ fioinit0.0 1 14 .90 0.00 f _ intorange0.0 1 14 .90 0.00 mov0.00 14 .90 0.00 1 0.00 0.00 initia l ize _

    Profile Listings

    Profile Listings on t h e Linux Clustersy gprof Output First Listing

    y The listing gives a 'flat' profile of functions and routinesencountered, sorted by 'self seconds' which is the number of seconds accounted for by this function alone.

    Profile Listings

  • 8/8/2019 01-Parallel Computing Explained

    144/237

    Ca ll grap h:

    index % time s e l f ch i l dren c a ll ed name----- ------ ---- -------- ---------------- ----------------[1] 72 .9 0.00 1 0. 86 main [1]

    5 . 67 5 . 18 1 0 1/1 0 1 c ompute _ [2]0.0 1 0.00 1 00 /1 00 update _ [8]0.00 0.00 1/1 initia l ize _ [12]

    ---------------------------------------------------------------------5 . 67 5 . 18 1 0 1/1 0 1 main [1]

    [2] 72 . 8 5 . 67 5 . 18 1 0 1 c ompute _ [2]5 . 17 0.00 251 99 5 00 /251 99 5 00 di s t _ [3]0.0 1 0.00 5 0 5 00 /5 0 5 00 dotr 8_ [7]

    ---------------------------------------------------------------------5 . 17 0.00 251 99 5 00 /251 99 5 00 c ompute _ [2]

    [3] 34 . 7 5 . 17 0.00 251 99 5 00 di s t _ [3]---------------------------------------------------------------------

    [4] 25 . 5 3 . 8 0 0.00 SIND_SINCOS [4]

    Profile ListingsProfile Listings on t

    h

    e Linux Clustersy gprof Output S econd Listing

    y The second listing gives a 'call-graph' profile of functions and routines encountered. Thdefinitions of the columns are specific to the line in question. Detailed information is

    contained in the full output from gprof.

    Profile Listings

  • 8/8/2019 01-Parallel Computing Explained

    145/237

    Co l umn s c orre s pond to t h e fo ll o wing event s:PAPI_TOT_CYC - T ota l c y cl e s (1 9 56 event s)

    F i l e Summary :1 00.0 % / u / n cs a / gbauer / temp / md.f

    Fun c tion Summary :84 . 4% c ompute15 . 6% di s t

    Line Summary :67 . 3% / u / n cs a / gbauer / temp / md.f :1 0 613 . 6% / u / n cs a / gbauer / temp / md.f :1 0 4

    9. 3% / u / n cs a / gbauer / temp / md.f :1662 . 5% / u / n cs a / gbauer / temp / md.f :1651 . 5% / u / n cs a / gbauer / temp / md.f :1 0 21 . 2% / u / n cs a / gbauer / temp / md.f :1640.9 % / u / n cs a / gbauer / temp / md.f :1 0 70. 8% / u / n cs a / gbauer / temp / md.f :16 90. 8% / u / n cs a / gbauer / temp / md.f :1620. 8% / u / n cs a / gbauer / temp / md.f :1 0 5

    Profile ListingsProfile Listings on t

    he Linux Clusters

    y vprof Listing

    y The above listing from (using the -e option to cprof), displays not only cycles consumed byfunctions (a flat profile) but also the lines in the code that contribute to those functions.

  • 8/8/2019 01-Parallel Computing Explained

    146/237

    Profiling Analysis

  • 8/8/2019 01-Parallel Computing Explained

    147/237

    Profiling Analysisy

    The program being analyzed in the previous Origin example hasapproximately1 0000 source code lines, and consists of manysubroutines.

    y The first profile listing shows that over 50% of the computation is doneinside the VSUB subroutine.

    y The second profile listing shows that line 81 06 in subroutine VSUBaccounted for 50% of the total computation.

    y Going back to the source code, line 81 06 is a line inside a do loop.y Putting an OpenMP compiler directive in front of that do loop you can get

    50% of the program to run in parallel with almost no work on your part.y Since the compiler has rearranged the source lines the line numbers

    given by ssrun/prof give you an area of the code to inspect.y

    To view the rearranged source use the optionf90 -FLIST:=ONcc -CLIST:=ON

    y For the Intel compilers, the appropriate options areifort E i cc -E

    Further Information

  • 8/8/2019 01-Parallel Computing Explained

    148/237

    Further Informationy SG

    I Irixy man etimey man 3 timey man1 timey man busagey man timersy man ssruny man prof y Origin2000 Performance Tuning and Optimization Guide

    y Linux Clustersy man 3 clocky man 2 gettimeofdayy man1 timey man1 gprof y man1 B qstaty Intel Compilers Vprof on NCSA Linux Cluster

    Agenda

  • 8/8/2019 01-Parallel Computing Explained

    149/237

    Agenda1 Parallel Computing Overview2 How to Parallelize a Code3 Porting Issues4 Scaler Tuning5 Parallel Code Tuning6 Timing and Profiling7 Cache Tuning

    8 Parallel Performance Analysis9 About the IBM Regatta P690

    Agenda

  • 8/8/2019 01-Parallel Computing Explained

    150/237

    Agenda

    7 Cache Tuning7.1 Cache Concepts7.1 .1 Memory Hierarchy7.1 .2 Cache Mapping7.1 .3 Cache Thrashing7.1 .4 Cache Coherence

    7.2 Cache Specifics7.3 Code 0ptimization7.4 Measuring Cache Performance7.5 Locating the Cache Problem7.6 Cache Tuning Strategy7.7 Preserve Spatial Locality

    7.8 Locality Problem7.9 Grouping Data Together7.1 0 Cache Thrashing Example7.11 Not Enough Cache7.1 2 Loop Blocking7.1 3 Further Information

    Cache Concepts

  • 8/8/2019 01-Parallel Computing Explained

    151/237

    Cache Conceptsy

    The CPU time required to perform an operation is the sum of theclock cycles executing instructions and the clock cycles waitingfor memory.

    y The CPU cannot be performing useful work if it is waiting fordata to arrive from memory.

    y

    Clearly then, the memory system is a major factor in determiningthe performance of your program and a large part is your use of the cache.

    y The following sections will discuss the key concepts of cacheincluding:y Memory subsystem hierarchyy Cache mappingy Cache thrashingy Cache coherence

    Memory Hierarchy

  • 8/8/2019 01-Parallel Computing Explained

    152/237

    Memory Hierarchyy

    The different subsystems in the memory hierarchy have differentspeeds, sizes, and costs.

    y Smaller memory is fastery Slower memory is cheaper

    y The