Programming the IBM Power3 SP

Programming the IBM Power3 SP

Eric AubanelAdvanced Computational Research Laboratory

Faculty of Computer Science, UNB

Advanced Computational Research Laboratory

• High Performance Computational Problem-Solving and Visualization Environment

• Computational Experiments in multiple disciplines: CS, Science and Eng.

• 16-Processor IBM SP3

• Member of C3.ca Association, Inc. (http://www.c3.ca)

Advanced Computational Research Laboratory

www.cs.unb.ca/acrl

• Virendra Bhavsar, Director

• Eric Aubanel, Research Associate & Scientific Computing Support

• Sean Seeley, System Administrator


• History and future of POWER chip

• Uni-processor optimization

• Description of ACRL’s IBM SP

• Parallel Processing– MPI– OpenMP

• Hybrid MPI/OpenMP• MPI-I/O (one slide)

POWER chip: 1990 to 2003

1990– Performance Optimized with Enhanced RISC– Reduced Instruction Set Computer– Superscalar: combined floating point multiply-

add (FMA) unit which allowed peak MFLOPS rate = 2 x MHz

– Initially: 25 MHz (50 MFLOPS) and 64 KB data cache


1991: SP1– IBM’s first SP (scalable power parallel)– Rack of standalone POWER processors (62.5

MHz) connected by internal switch network– Parallel Environment & system software


1993: POWER2– 2 FMAs– Increased data cache size– 66.5 MHz (254 MFLOPS)– Improved instruction set (incl. Hardware square

root)– SP2: POWER2 + higher bandwidth switch for

larger systems


1993: POWERPCSupport SMP

1996: P2SCPOWER2 super chip: clock speeds up to 160

MHz


Feb. ‘99: POWER3– Combined P2SC & POWERPC– 64 bit architecture– Initially 2-way SMP, 200 MHz– Cache improvement, including L2 cache of 1-

16 MB– Instruction & data prefetch

POWER3+ chip: Feb. 2000• Winterhawk II - 375 MHz

• 4- way SMP

• 2 MULT/ ADD - 1500 MFLOPS

• 64 KB Level 1 - 5 nsec/ 3.2 GB/ sec

• 8 MB Level 2 - 45 nsec/ 6.4 GB/ sec

• 1.6 GB/ s Memory Bandwidth

• 6 GFLOPS/ Node

• Nighthawk II - 375 MHz

• 16- way SMP

• 2 MULT/ ADD - 1500 MFLOPS

• 64 KB Level 1 - 5 nsec/ 3.2 GB/ sec

• 8 MB Level 2 - 45 nsec/ 6.4 GB/ sec

• 14 GB/ s Memory Bandwidth

• 24 GFLOPS/ Node

The Clustered SMP

ACRL’s SP: Four 4-way SMPs

Each node has its own copy of the O/S

Processors on the node are closer than those on differentnodes

Power3 Architecture

Power4 - 32 way

• Logical UMA

• SP High Node

• L3 cache shared between all processors on node - 32 MB

• Up to 32 GB main memory

• Each processor: 1.1 GHz

• 140 Gflops total peak

2 procsPrivate L1,

L2

2 procsPrivate L1,

L2

2 procsPrivate L1,

L2

2 procsPrivate L1,

L2

GX Bus GX Bus

2 procsPrivate L1,

L2

2 procsPrivate L1,

L2

2 procsPrivate L1,

L2

2 procsPrivate L1,

L2

GX Bus GX Bus

2 procsPrivate L1,

L2

2 procsPrivate L1,

L2

2 procsPrivate L1,

L2

2 procsPrivate L1,

L2

GX Bus GX Bus

2 procsPrivate L1,

L2

2 procsPrivate L1,

L2

2 procsPrivate L1,

L2

2 procsPrivate L1,

L2

GX Bus GX Bus

Going to NUMA32 way GP High nodeOwn copy of AIX 128+ GFLOPS/high nodeMultiple Federation Adapters for scaleable inter-node BWNUMA up to 256 Processors

Federation Switch

SP GP Node

AIX

Federation Adapters

Memory

Processors / Intra-node Interconnect Up to 16

Links

SP GP Node

AIX

Federation Adapters

Memory

Up to 16 Links

Processors / Intra-node Interconnect

NUMA up to 256 processors - 1.1 Teraflops







Uni-processor Optimization

• Compiler options: – start with -O3 -qstrict, then -O3, -qarch=pwr3

• Cache re-use

• Take advantage of superscalar architecture – give enough operations per load/store

• Use ESSL - optimization already maximally exploited

Memory Access Times

Memory to L2or L1

L2 to L1 L1 toRegisters

Width 16 bytes/2cycles

32 bytes/cycle 2 x 8bytes/cycle

Rate 1.6 GB/s 6.4 GB/s 3.2 GB/s

Latency 35 cycles(approximately)

6 to 7 cycles(approximately)

1 cycle

Cache128 byte cache line

2 MB

2 MB

2 MB

2 MB

L2 cache: 4-way set-associative, 8 MB total

L1 cache: 128-way set-associative, 64 KB

How to Monitor Performance?

• IBM’s hardware monitor: HPMCOUNT– Uses hardware counters on chip– Cache & TLB misses, fp ops, load-stores, …– Beta version – Available soon on ACRL’s SP

HMPCOUNT sample output

real*8 a(256,256),b(256,256),c(256,256)

common a,b,c

do j=1,256

do i=1,256

a(i,j)=b(i,j)+c(i,j)

end do

end do

end

PM_TLB_MISS (TLB misses) : 66543

Average number of loads per TLB miss : 5.916

Total loads and stores : 0.525 M

Instructions per load/store : 2.749

Cycles per instruction : 2.378

Instructions per cycle : 0.420

Total floating point operations : 0.066 M

Hardware float point rate : 2.749 Mflop/sec

HMPCOUNT sample output

real*8 a(257,256),b(257,256),c(257,256)

common a,b,c

do j=1,256

do i=1,257

a(i,j)=b(i,j)+c(i,j)

end do

end do

end

PM_TLB_MISS (TLB misses) : 1634

Average number of loads per TLB miss : 241.876

Total loads and stores : 0.527 M

Instructions per load/store : 2.749

Cycles per instruction : 1.271

Instructions per cycle : 0.787

Total floating point operations : 0.066 M

Hardware float point rate : 3.525 Mflop/sec

ESSL

• Linear algebra, Fourier & related transforms, sorting, interpolation, quadrature, random numbers

• Fast!– 560x560 real*8 matrix multiply

• Hand coding: 19 Mflops

• dgemm: 1.2 GFlops

• Parallel (threaded and distributed) versions







ACRL’s IBM SP

• 4 Winterhawk II nodes– 16 processors

• Each node has:– 1 GB RAM

– 9 GB (mirrored) disk on each node

– Switch adapter

• High Perforrnance Switch

• Gigabit Ethernet (1 node)

• Control workstation

• Disk: SSA tower with 6 18.2 GB disks

Disk

Gigabit Ethernet

IBM Power3 SP Switch

• Bidirectional multistage interconnection networks (MIN)

• 300 MB/sec bi-directional

• 1.2 sec latency

General Parallel File System

Application

GPFS Client

RVSD/VSD

Application

GPFS Client

RVSD/VSD

Application

GPFS Client

RVSD/VSD

Application

GPFS Server

RVSD/VSD

Node 2 Node 3 Node 4

Node 1

SP Switch

ACRL Software• Operating System: AIX 4.3.3

• Compilers– IBM XL Fortran 7.1 (HPF not yet installed)

– VisualAge C for AIX, Version 5.0.1.0

– VisualAge C++ Professional for AIX, Version 5.0.0.0

– IBM Visual Age Java - not yet installed

• Job Scheduler: Loadleveler 2.2

• Parallel Programming Tools– IBM Parallel Environment 3.1: MPI, MPI-2 parallel I/O

• Numerical Libraries: ESSL (v. 3.2) and Parallel ESSL (v. 2.2 )

• Visualization: OpenDX (not yet installed)

• E-Commerce software (not yet installed)







Why Parallel Computing?• Solve large problems in reasonable time• Many algorithms are inherently parallel

– image processing, Monte Carlo

– Simulations (eg. CFD)

• High performance computers have parallel architectures– Commercial off-the shelf (COTS) components

• Beowulf clusters

• SMP nodes

– Improvements in network technology

NRL Layered Ocean Model at Naval Research Laboratory

IBM Winterhawk II SP

Parallel Computational Models

• Data Parallelism– Parallel program looks like serial program

• parallelism in the data

– Vector processors– HPF


• Message Passing (MPI)– Processes have only local memory but can communicate

with other processes by sending & receiving messages– Data transfer between processes requires operations to be

performed by both processes– Communication network not part of computational

model (hypercube, torus, …)

Send Receive


• Shared Memory (threads)– P(osix)threads– OpenMP: higher level standard

Address space

Processes


• Remote Memory Operations– “One-sided” communication

• MPI-2, IBM’s LAPI

– One process can access the memory of another without the other’s participation, but does so explicitly, not the same way it accesses local memory

Put

Get


• Combined: Message Passing & Threads– Driven by clusters of SMPs

– Leads to software complexity!

Address space

Processes

Address space

Processes

Address space

Processes

Network







Message Passing Interface

• MPI 1.0 standard in 1994

• MPI 1.1 in 1995 - IBM support

• MPI 2.0 in 1997– Includes 1.1 but adds new features

• MPI-IO

• One-sided communication

• Dynamic processes

Advantages of MPI

• Universality

• Expressivity– Well suited to formulating a parallel algorithm

• Ease of debugging– Memory is local

• Performance– Explicit association of data with process allows

good use of cache

MPI Functionality• Several modes of point-to-point message passing

– blocking (e.g. MPI_SEND)

– non-blocking (e.g. MPI_ISEND)

– synchronous (e.g. MPI_SSEND)

– buffered (e.g. MPI_BSEND)

• Collective communication and synchronization– e.g. MPI_REDUCE, MPI_BARRIER

• User-defined datatypes

• Logically distinct communicator spaces

• Application-level or virtual topologies

Simple MPI Example

My_Id 0 1

This is from MPI process number 0

This is from MPI processes other than 0

Simple MPI ExampleProgram Trivial

implicit none

include "mpif.h" ! MPI header file

integer My_Id, Numb_of_Procs, Ierr

call MPI_INIT ( ierr )

call MPI_COMM_RANK ( MPI_COMM_WORLD, My_Id, ierr )

call MPI_COMM_SIZE ( MPI_COMM_WORLD, Numb_of_Procs, ierr )

print *, ' My_id, numb_of_procs = ', My_Id, Numb_of_Procs

if ( My_Id .eq. 0 ) then

print *, ' This is from MPI process number ',My_Id

else

print *, ' This is from MPI processes other than 0 ', My_Id

end if

call MPI_FINALIZE ( ierr ) ! bad things happen if you forget ierr

stop

end

MPI Example with send/recv

My_Id 0 1

Send Receive

SendReceive

MPI Example with send/recvProgram Simple

implicit none

Include "mpif.h"

Integer My_Id, Other_Id, Nx, Ierr

Parameter ( Nx = 100 )

Real A ( Nx ), B ( Nx )

call MPI_INIT ( Ierr )

call MPI_COMM_RANK ( MPI_COMM_WORLD, My_Id, Ierr )

Other_Id = Mod ( My_Id + 1, 2 )

A = My_Id

call MPI_SEND ( A, Nx, MPI_REAL, Other_Id, My_Id, MPI_COMM_WORLD, Ierr )

call MPI_RECV ( B, Nx, MPI_REAL, Other_Id, Other_Id, MPI_COMM_WORLD, Ierr )

call MPI_FINALIZE ( Ierr )

stop

end

What Will Happen?/* Processor 0 */

...

MPI_Send(sendbuf,

bufsize,

MPI_CHAR,

partner,

tag,

MPI_COMM_WORLD);

printf("Posting receive now ...\n");

MPI_Recv(recvbuf,

bufsize,

MPI_CHAR,

partner,

tag,

MPI_COMM_WORLD,

status);

/* Processor 1 */

...

MPI_Send(sendbuf,

bufsize,

MPI_CHAR,

partner,

tag,

MPI_COMM_WORLD);

printf("Posting receive now ...\n");

MPI_Recv(recvbuf,

bufsize,

MPI_CHAR,

partner,

tag,

MPI_COMM_WORLD,

status);

MPI Message Passing Modes

Ready

Standard

Synchronous

Buffered

Ready

Eager

Rendezvous

Buffered

> eager limit

<= eager limit

Default Eager Limit on SP is 4 KB (can be up to 64 KB)

MPI Performance Visualization

• ParaGraph– Developed by University of Illinois– Graphical display system for visualizing

behaviour and performance of MPI programs

Message Passing on SMP

Call MPI_SEND Call MPI_RECEIVE

BufferBuffer

Memory Crossbar or Switch

Data toSend

ReceivedData

export MP_SHARED_MEMORY=yes|no

Shared Memory MPI

MPI_SHARED_MEMORY=<yes|no>

Latency Bandwidth

(sec) (Mbytes/sec)– between 2 nodes: 24 133– same nodes: 30 (no) 80 (no)– same nodes: 10 (yes) 270(yes)

Message Passing off Node

MPI Across all the processors

Many more messages going through the fabric







OpenMP• 1997: group of hardware and software vendors

announced their support for OpenMP, a new API for multi-platform shared-memory programming (SMP) on UNIX and Microsoft Windows NT platforms.

• www.openmp.org• OpenMP parallelism specified through the use of

compiler directives which are imbedded in C/C++ or Fortran source code. IBM does not yet support OpenMP for C++.

OpenMP

• All processors can access all the memory in the parallel system

• Parallel execution is achieved by generating threads which execute in parallel

• Overhead for SMP parallelization is large (100-200 sec)- size of parallel work construct must be significant enough to overcome overhead

OpenMP1.All OpenMP programs begin as a single process: the master thread

2.FORK: the master thread then creates a team of parallel threads

3.Parallel region statements executed in parallel among the various team threads

4.JOIN: threads synchronize and terminate, leaving only the master thread

OpenMP

How is OpenMP typically used?

• OpenMP is usually used to parallelize loops:– Find your most time consuming loops.– Split them up between threads.

• Better scaling can be obtained using OpenMP parallel regions, but can be tricky!

OpenMP Loop Parallelization!$OMP PARALLEL DO

do i=0,ilong

do k=1,kshort

...

end do

end do

#pragma omp parallel for

for(i=0; i <= ilong; i++)

for(k=1; k <= kshort; k++) {

...

}

Variable Scoping• Most difficult part of Shared Memory

Parallelization– What memory is Shared

– What memory is Private - each processor has its own copy

• Compare MPI: all variables are private• Variables are shared by default, except:

– loop indices

– scalars that are set and then used in loop

How Does Sharing Work?

THREAD 1: increment(x)

{

x = x + 1;

}

THREAD 1:

10 LOAD A, (x address)

20 ADD A, 1

30 STORE A, (x address)

THREAD 2: increment(x)

{ x = x + 1;

}

THREAD 2: 10 LOAD A, (x address)

20 ADD A, 1

30 STORE A, (x address)

Shared X initially 0

Result could be 1 or 2

Need synchronization

False Sharing7

6

5

4

3

2

1

0

Processor 1 Processor 2

Block in Cache

Cache line

Address tag

Block

Say A(1-5)starts on cache line, then some of A(6-10) will be on first cache line so won’t be accessible until first thread finished

!$OMP PARALLEL DO do I=1,20 A(I)= ...enddo







Why Hybrid MPI-OpenMP?

• To optimize performance on “mixed-mode” hardware like the SP

• MPI is used for “inter-node” communication, and OpenMP is used for “intra-node” communication– threads have lower latency – threads can alleviate network contention of a

pure MPI implementation

Hybrid MPI-OpenMP?• Unless you are forced against your will, for the hybrid

model to be worthwhile:– There has to be obvious parallelism to exploit

– The code has to be easy to program and maintain• easy to write bad OpenMP code

– It has to promise to perform at least as well as the equivalent all-MPI program

• Experience has shown that converting working MPI code to a hybrid model rarely results in better performance – especially true with applications having a single level of

parallelism

Hybrid Scenario• Thread the computational portions of the code that

exist between MPI calls• MPI calls are “single-threaded” and therefore use

only a single CPU.• Assumes:

– application has two natural levels of parallelism– or that in breaking an MPI code with one level

of parallelism that communication between resulting threads is little/none







MPI-IO

• Part of MPI-2• Resulted work at IBM Research exploring the

analogy between I/O and message passing• See “Using MPI-2”, by Gropp et al. (MIT Press)

memory

processes

file

Conclusion• Don’t forget uni-processor optimization

• If you choose one parallel programming API, choose MPI

• Mixed MPI-OpenMP may be appropriate in certain cases– More work needed here

• Remote memory access model may be the answer

Documents

Programming the IBM Power3 SP