70
Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB

Programming the IBM Power3 SP

  • Upload
    aizza

  • View
    55

  • Download
    0

Embed Size (px)

DESCRIPTION

Programming the IBM Power3 SP. Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB. Advanced Computational Research Laboratory. High Performance Computational Problem-Solving and Visualization Environment - PowerPoint PPT Presentation

Citation preview

Page 1: Programming the IBM Power3 SP

Programming the IBM Power3 SP

Eric AubanelAdvanced Computational Research Laboratory

Faculty of Computer Science, UNB

Page 2: Programming the IBM Power3 SP

Advanced Computational Research Laboratory

• High Performance Computational Problem-Solving and Visualization Environment

• Computational Experiments in multiple disciplines: CS, Science and Eng.

• 16-Processor IBM SP3

• Member of C3.ca Association, Inc. (http://www.c3.ca)

Page 3: Programming the IBM Power3 SP

Advanced Computational Research Laboratory

www.cs.unb.ca/acrl

• Virendra Bhavsar, Director

• Eric Aubanel, Research Associate & Scientific Computing Support

• Sean Seeley, System Administrator

Page 4: Programming the IBM Power3 SP
Page 5: Programming the IBM Power3 SP
Page 6: Programming the IBM Power3 SP

Programming the IBM Power3 SP

• History and future of POWER chip

• Uni-processor optimization

• Description of ACRL’s IBM SP

• Parallel Processing– MPI– OpenMP

• Hybrid MPI/OpenMP• MPI-I/O (one slide)

Page 7: Programming the IBM Power3 SP

POWER chip: 1990 to 2003

1990– Performance Optimized with Enhanced RISC– Reduced Instruction Set Computer– Superscalar: combined floating point multiply-

add (FMA) unit which allowed peak MFLOPS rate = 2 x MHz

– Initially: 25 MHz (50 MFLOPS) and 64 KB data cache

Page 8: Programming the IBM Power3 SP

POWER chip: 1990 to 2003

1991: SP1– IBM’s first SP (scalable power parallel)– Rack of standalone POWER processors (62.5

MHz) connected by internal switch network– Parallel Environment & system software

Page 9: Programming the IBM Power3 SP

POWER chip: 1990 to 2003

1993: POWER2– 2 FMAs– Increased data cache size– 66.5 MHz (254 MFLOPS)– Improved instruction set (incl. Hardware square

root)– SP2: POWER2 + higher bandwidth switch for

larger systems

Page 10: Programming the IBM Power3 SP

POWER chip: 1990 to 2003

1993: POWERPCSupport SMP

1996: P2SCPOWER2 super chip: clock speeds up to 160

MHz

Page 11: Programming the IBM Power3 SP

POWER chip: 1990 to 2003

Feb. ‘99: POWER3– Combined P2SC & POWERPC– 64 bit architecture– Initially 2-way SMP, 200 MHz– Cache improvement, including L2 cache of 1-

16 MB– Instruction & data prefetch

Page 12: Programming the IBM Power3 SP

POWER3+ chip: Feb. 2000• Winterhawk II - 375 MHz

• 4- way SMP

• 2 MULT/ ADD - 1500 MFLOPS

• 64 KB Level 1 - 5 nsec/ 3.2 GB/ sec

• 8 MB Level 2 - 45 nsec/ 6.4 GB/ sec

• 1.6 GB/ s Memory Bandwidth

• 6 GFLOPS/ Node

• Nighthawk II - 375 MHz

• 16- way SMP

• 2 MULT/ ADD - 1500 MFLOPS

• 64 KB Level 1 - 5 nsec/ 3.2 GB/ sec

• 8 MB Level 2 - 45 nsec/ 6.4 GB/ sec

• 14 GB/ s Memory Bandwidth

• 24 GFLOPS/ Node

Page 13: Programming the IBM Power3 SP

The Clustered SMP

ACRL’s SP: Four 4-way SMPs

Each node has its own copy of the O/S

Processors on the node are closer than those on differentnodes

Page 14: Programming the IBM Power3 SP

Power3 Architecture

Page 15: Programming the IBM Power3 SP

Power4 - 32 way

• Logical UMA

• SP High Node

• L3 cache shared between all processors on node - 32 MB

• Up to 32 GB main memory

• Each processor: 1.1 GHz

• 140 Gflops total peak

2 procsPrivate L1,

L2

2 procsPrivate L1,

L2

2 procsPrivate L1,

L2

2 procsPrivate L1,

L2

GX Bus GX Bus

2 procsPrivate L1,

L2

2 procsPrivate L1,

L2

2 procsPrivate L1,

L2

2 procsPrivate L1,

L2

GX Bus GX Bus

2 procsPrivate L1,

L2

2 procsPrivate L1,

L2

2 procsPrivate L1,

L2

2 procsPrivate L1,

L2

GX Bus GX Bus

2 procsPrivate L1,

L2

2 procsPrivate L1,

L2

2 procsPrivate L1,

L2

2 procsPrivate L1,

L2

GX Bus GX Bus

Page 16: Programming the IBM Power3 SP

Going to NUMA32 way GP High nodeOwn copy of AIX 128+ GFLOPS/high nodeMultiple Federation Adapters for scaleable inter-node BWNUMA up to 256 Processors

Federation Switch

SP GP Node

AIX

Federation Adapters

Memory

Processors / Intra-node Interconnect Up to 16

Links

SP GP Node

AIX

Federation Adapters

Memory

Up to 16 Links

Processors / Intra-node Interconnect

NUMA up to 256 processors - 1.1 Teraflops

Page 17: Programming the IBM Power3 SP

Programming the IBM Power3 SP

• History and future of POWER chip

• Uni-processor optimization

• Description of ACRL’s IBM SP

• Parallel Processing– MPI– OpenMP

• Hybrid MPI/OpenMP• MPI-I/O (one slide)

Page 18: Programming the IBM Power3 SP

Uni-processor Optimization

• Compiler options: – start with -O3 -qstrict, then -O3, -qarch=pwr3

• Cache re-use

• Take advantage of superscalar architecture – give enough operations per load/store

• Use ESSL - optimization already maximally exploited

Page 19: Programming the IBM Power3 SP

Memory Access Times

Memory to L2or L1

L2 to L1 L1 toRegisters

Width 16 bytes/2cycles

32 bytes/cycle 2 x 8bytes/cycle

Rate 1.6 GB/s 6.4 GB/s 3.2 GB/s

Latency 35 cycles(approximately)

6 to 7 cycles(approximately)

1 cycle

Page 20: Programming the IBM Power3 SP

Cache128 byte cache line

2 MB

2 MB

2 MB

2 MB

L2 cache: 4-way set-associative, 8 MB total

L1 cache: 128-way set-associative, 64 KB

Page 21: Programming the IBM Power3 SP

How to Monitor Performance?

• IBM’s hardware monitor: HPMCOUNT– Uses hardware counters on chip– Cache & TLB misses, fp ops, load-stores, …– Beta version – Available soon on ACRL’s SP

Page 22: Programming the IBM Power3 SP

HMPCOUNT sample output

real*8 a(256,256),b(256,256),c(256,256)

common a,b,c

do j=1,256

do i=1,256

a(i,j)=b(i,j)+c(i,j)

end do

end do

end

PM_TLB_MISS (TLB misses) : 66543

Average number of loads per TLB miss : 5.916

Total loads and stores : 0.525 M

Instructions per load/store : 2.749

Cycles per instruction : 2.378

Instructions per cycle : 0.420

Total floating point operations : 0.066 M

Hardware float point rate : 2.749 Mflop/sec

Page 23: Programming the IBM Power3 SP

HMPCOUNT sample output

real*8 a(257,256),b(257,256),c(257,256)

common a,b,c

do j=1,256

do i=1,257

a(i,j)=b(i,j)+c(i,j)

end do

end do

end

PM_TLB_MISS (TLB misses) : 1634

Average number of loads per TLB miss : 241.876

Total loads and stores : 0.527 M

Instructions per load/store : 2.749

Cycles per instruction : 1.271

Instructions per cycle : 0.787

Total floating point operations : 0.066 M

Hardware float point rate : 3.525 Mflop/sec

Page 24: Programming the IBM Power3 SP

ESSL

• Linear algebra, Fourier & related transforms, sorting, interpolation, quadrature, random numbers

• Fast!– 560x560 real*8 matrix multiply

• Hand coding: 19 Mflops

• dgemm: 1.2 GFlops

• Parallel (threaded and distributed) versions

Page 25: Programming the IBM Power3 SP

Programming the IBM Power3 SP

• History and future of POWER chip

• Uni-processor optimization

• Description of ACRL’s IBM SP

• Parallel Processing– MPI– OpenMP

• Hybrid MPI/OpenMP• MPI-I/O (one slide)

Page 26: Programming the IBM Power3 SP

ACRL’s IBM SP

• 4 Winterhawk II nodes– 16 processors

• Each node has:– 1 GB RAM

– 9 GB (mirrored) disk on each node

– Switch adapter

• High Perforrnance Switch

• Gigabit Ethernet (1 node)

• Control workstation

• Disk: SSA tower with 6 18.2 GB disks

Disk

Gigabit Ethernet

Page 27: Programming the IBM Power3 SP
Page 28: Programming the IBM Power3 SP

IBM Power3 SP Switch

• Bidirectional multistage interconnection networks (MIN)

• 300 MB/sec bi-directional

• 1.2 sec latency

Page 29: Programming the IBM Power3 SP

General Parallel File System

Application

GPFS Client

RVSD/VSD

Application

GPFS Client

RVSD/VSD

Application

GPFS Client

RVSD/VSD

Application

GPFS Server

RVSD/VSD

Node 2 Node 3 Node 4

Node 1

SP Switch

Page 30: Programming the IBM Power3 SP

ACRL Software• Operating System: AIX 4.3.3

• Compilers– IBM XL Fortran 7.1 (HPF not yet installed)

– VisualAge C for AIX, Version 5.0.1.0

– VisualAge C++ Professional for AIX, Version 5.0.0.0

– IBM Visual Age Java - not yet installed

• Job Scheduler: Loadleveler 2.2

• Parallel Programming Tools– IBM Parallel Environment 3.1: MPI, MPI-2 parallel I/O

• Numerical Libraries: ESSL (v. 3.2) and Parallel ESSL (v. 2.2 )

• Visualization: OpenDX (not yet installed)

• E-Commerce software (not yet installed)

Page 31: Programming the IBM Power3 SP

Programming the IBM Power3 SP

• History and future of POWER chip

• Uni-processor optimization

• Description of ACRL’s IBM SP

• Parallel Processing– MPI– OpenMP

• Hybrid MPI/OpenMP• MPI-I/O (one slide)

Page 32: Programming the IBM Power3 SP

Why Parallel Computing?• Solve large problems in reasonable time• Many algorithms are inherently parallel

– image processing, Monte Carlo

– Simulations (eg. CFD)

• High performance computers have parallel architectures– Commercial off-the shelf (COTS) components

• Beowulf clusters

• SMP nodes

– Improvements in network technology

Page 33: Programming the IBM Power3 SP

NRL Layered Ocean Model at Naval Research Laboratory

IBM Winterhawk II SP

Page 34: Programming the IBM Power3 SP

Parallel Computational Models

• Data Parallelism– Parallel program looks like serial program

• parallelism in the data

– Vector processors– HPF

Page 35: Programming the IBM Power3 SP

Parallel Computational Models

• Message Passing (MPI)– Processes have only local memory but can communicate

with other processes by sending & receiving messages– Data transfer between processes requires operations to be

performed by both processes– Communication network not part of computational

model (hypercube, torus, …)

Send Receive

Page 36: Programming the IBM Power3 SP

Parallel Computational Models

• Shared Memory (threads)– P(osix)threads– OpenMP: higher level standard

Address space

Processes

Page 37: Programming the IBM Power3 SP

Parallel Computational Models

• Remote Memory Operations– “One-sided” communication

• MPI-2, IBM’s LAPI

– One process can access the memory of another without the other’s participation, but does so explicitly, not the same way it accesses local memory

Put

Get

Page 38: Programming the IBM Power3 SP

Parallel Computational Models

• Combined: Message Passing & Threads– Driven by clusters of SMPs

– Leads to software complexity!

Address space

Processes

Address space

Processes

Address space

Processes

Network

Page 39: Programming the IBM Power3 SP

Programming the IBM Power3 SP

• History and future of POWER chip

• Uni-processor optimization

• Description of ACRL’s IBM SP

• Parallel Processing– MPI– OpenMP

• Hybrid MPI/OpenMP• MPI-I/O (one slide)

Page 40: Programming the IBM Power3 SP

Message Passing Interface

• MPI 1.0 standard in 1994

• MPI 1.1 in 1995 - IBM support

• MPI 2.0 in 1997– Includes 1.1 but adds new features

• MPI-IO

• One-sided communication

• Dynamic processes

Page 41: Programming the IBM Power3 SP

Advantages of MPI

• Universality

• Expressivity– Well suited to formulating a parallel algorithm

• Ease of debugging– Memory is local

• Performance– Explicit association of data with process allows

good use of cache

Page 42: Programming the IBM Power3 SP

MPI Functionality• Several modes of point-to-point message passing

– blocking (e.g. MPI_SEND)

– non-blocking (e.g. MPI_ISEND)

– synchronous (e.g. MPI_SSEND)

– buffered (e.g. MPI_BSEND)

• Collective communication and synchronization– e.g. MPI_REDUCE, MPI_BARRIER

• User-defined datatypes

• Logically distinct communicator spaces

• Application-level or virtual topologies

Page 43: Programming the IBM Power3 SP

Simple MPI Example

My_Id 0 1

This is from MPI process number 0

This is from MPI processes other than 0

Page 44: Programming the IBM Power3 SP

Simple MPI ExampleProgram Trivial

implicit none

include "mpif.h" ! MPI header file

integer My_Id, Numb_of_Procs, Ierr

call MPI_INIT ( ierr )

call MPI_COMM_RANK ( MPI_COMM_WORLD, My_Id, ierr )

call MPI_COMM_SIZE ( MPI_COMM_WORLD, Numb_of_Procs, ierr )

print *, ' My_id, numb_of_procs = ', My_Id, Numb_of_Procs

if ( My_Id .eq. 0 ) then

print *, ' This is from MPI process number ',My_Id

else

print *, ' This is from MPI processes other than 0 ', My_Id

end if

call MPI_FINALIZE ( ierr ) ! bad things happen if you forget ierr

stop

end

Page 45: Programming the IBM Power3 SP

MPI Example with send/recv

My_Id 0 1

Send Receive

SendReceive

Page 46: Programming the IBM Power3 SP

MPI Example with send/recvProgram Simple

implicit none

Include "mpif.h"

Integer My_Id, Other_Id, Nx, Ierr

Parameter ( Nx = 100 )

Real A ( Nx ), B ( Nx )

call MPI_INIT ( Ierr )

call MPI_COMM_RANK ( MPI_COMM_WORLD, My_Id, Ierr )

Other_Id = Mod ( My_Id + 1, 2 )

A = My_Id

call MPI_SEND ( A, Nx, MPI_REAL, Other_Id, My_Id, MPI_COMM_WORLD, Ierr )

call MPI_RECV ( B, Nx, MPI_REAL, Other_Id, Other_Id, MPI_COMM_WORLD, Ierr )

call MPI_FINALIZE ( Ierr )

stop

end

Page 47: Programming the IBM Power3 SP

What Will Happen?/* Processor 0 */

...

MPI_Send(sendbuf,

bufsize,

MPI_CHAR,

partner,

tag,

MPI_COMM_WORLD);

printf("Posting receive now ...\n");

MPI_Recv(recvbuf,

bufsize,

MPI_CHAR,

partner,

tag,

MPI_COMM_WORLD,

status);

/* Processor 1 */

...

MPI_Send(sendbuf,

bufsize,

MPI_CHAR,

partner,

tag,

MPI_COMM_WORLD);

printf("Posting receive now ...\n");

MPI_Recv(recvbuf,

bufsize,

MPI_CHAR,

partner,

tag,

MPI_COMM_WORLD,

status);

Page 48: Programming the IBM Power3 SP

MPI Message Passing Modes

Ready

Standard

Synchronous

Buffered

Ready

Eager

Rendezvous

Buffered

> eager limit

<= eager limit

Default Eager Limit on SP is 4 KB (can be up to 64 KB)

Page 49: Programming the IBM Power3 SP

MPI Performance Visualization

• ParaGraph– Developed by University of Illinois– Graphical display system for visualizing

behaviour and performance of MPI programs

Page 50: Programming the IBM Power3 SP
Page 51: Programming the IBM Power3 SP
Page 52: Programming the IBM Power3 SP

Message Passing on SMP

Call MPI_SEND Call MPI_RECEIVE

BufferBuffer

Memory Crossbar or Switch

Data toSend

ReceivedData

export MP_SHARED_MEMORY=yes|no

Page 53: Programming the IBM Power3 SP

Shared Memory MPI

MPI_SHARED_MEMORY=<yes|no>

Latency Bandwidth

(sec) (Mbytes/sec)– between 2 nodes: 24 133– same nodes: 30 (no) 80 (no)– same nodes: 10 (yes) 270(yes)

Page 54: Programming the IBM Power3 SP

Message Passing off Node

MPI Across all the processors

Many more messages going through the fabric

Page 55: Programming the IBM Power3 SP

Programming the IBM Power3 SP

• History and future of POWER chip

• Uni-processor optimization

• Description of ACRL’s IBM SP

• Parallel Processing– MPI– OpenMP

• Hybrid MPI/OpenMP• MPI-I/O (one slide)

Page 56: Programming the IBM Power3 SP

OpenMP• 1997: group of hardware and software vendors

announced their support for OpenMP, a new API for multi-platform shared-memory programming (SMP) on UNIX and Microsoft Windows NT platforms.

• www.openmp.org• OpenMP parallelism specified through the use of

compiler directives which are imbedded in C/C++ or Fortran source code. IBM does not yet support OpenMP for C++.

Page 57: Programming the IBM Power3 SP

OpenMP

• All processors can access all the memory in the parallel system

• Parallel execution is achieved by generating threads which execute in parallel

• Overhead for SMP parallelization is large (100-200 sec)- size of parallel work construct must be significant enough to overcome overhead

Page 58: Programming the IBM Power3 SP

OpenMP1.All OpenMP programs begin as a single process: the master thread

2.FORK: the master thread then creates a team of parallel threads

3.Parallel region statements executed in parallel among the various team threads

4.JOIN: threads synchronize and terminate, leaving only the master thread

Page 59: Programming the IBM Power3 SP

OpenMP

How is OpenMP typically used?

• OpenMP is usually used to parallelize loops:– Find your most time consuming loops.– Split them up between threads.

• Better scaling can be obtained using OpenMP parallel regions, but can be tricky!

Page 60: Programming the IBM Power3 SP

OpenMP Loop Parallelization!$OMP PARALLEL DO

do i=0,ilong

do k=1,kshort

...

end do

end do

#pragma omp parallel for

for(i=0; i <= ilong; i++)

for(k=1; k <= kshort; k++) {

...

}

Page 61: Programming the IBM Power3 SP

Variable Scoping• Most difficult part of Shared Memory

Parallelization– What memory is Shared

– What memory is Private - each processor has its own copy

• Compare MPI: all variables are private• Variables are shared by default, except:

– loop indices

– scalars that are set and then used in loop

Page 62: Programming the IBM Power3 SP

How Does Sharing Work?

THREAD 1: increment(x)

{

x = x + 1;

}

THREAD 1:

10 LOAD A, (x address)

20 ADD A, 1

30 STORE A, (x address)

THREAD 2: increment(x)

{ x = x + 1;

}

THREAD 2: 10 LOAD A, (x address)

20 ADD A, 1

30 STORE A, (x address)

Shared X initially 0

Result could be 1 or 2

Need synchronization

Page 63: Programming the IBM Power3 SP

False Sharing7

6

5

4

3

2

1

0

Processor 1 Processor 2

Block in Cache

Cache line

Address tag

Block

Say A(1-5)starts on cache line, then some of A(6-10) will be on first cache line so won’t be accessible until first thread finished

!$OMP PARALLEL DO do I=1,20 A(I)= ...enddo

Page 64: Programming the IBM Power3 SP

Programming the IBM Power3 SP

• History and future of POWER chip

• Uni-processor optimization

• Description of ACRL’s IBM SP

• Parallel Processing– MPI– OpenMP

• Hybrid MPI/OpenMP• MPI-I/O (one slide)

Page 65: Programming the IBM Power3 SP

Why Hybrid MPI-OpenMP?

• To optimize performance on “mixed-mode” hardware like the SP

• MPI is used for “inter-node” communication, and OpenMP is used for “intra-node” communication– threads have lower latency – threads can alleviate network contention of a

pure MPI implementation

Page 66: Programming the IBM Power3 SP

Hybrid MPI-OpenMP?• Unless you are forced against your will, for the hybrid

model to be worthwhile:– There has to be obvious parallelism to exploit

– The code has to be easy to program and maintain• easy to write bad OpenMP code

– It has to promise to perform at least as well as the equivalent all-MPI program

• Experience has shown that converting working MPI code to a hybrid model rarely results in better performance – especially true with applications having a single level of

parallelism

Page 67: Programming the IBM Power3 SP

Hybrid Scenario• Thread the computational portions of the code that

exist between MPI calls• MPI calls are “single-threaded” and therefore use

only a single CPU.• Assumes:

– application has two natural levels of parallelism– or that in breaking an MPI code with one level

of parallelism that communication between resulting threads is little/none

Page 68: Programming the IBM Power3 SP

Programming the IBM Power3 SP

• History and future of POWER chip

• Uni-processor optimization

• Description of ACRL’s IBM SP

• Parallel Processing– MPI– OpenMP

• Hybrid MPI/OpenMP• MPI-I/O (one slide)

Page 69: Programming the IBM Power3 SP

MPI-IO

• Part of MPI-2• Resulted work at IBM Research exploring the

analogy between I/O and message passing• See “Using MPI-2”, by Gropp et al. (MIT Press)

memory

processes

file

Page 70: Programming the IBM Power3 SP

Conclusion• Don’t forget uni-processor optimization

• If you choose one parallel programming API, choose MPI

• Mixed MPI-OpenMP may be appropriate in certain cases– More work needed here

• Remote memory access model may be the answer