Performance Characteristics of a Cosmology Package on Leading HPC Architectures Leonid Oliker oliker Julian Borrill, Jonathan Carter

Performance Characteristics of a Cosmology Package

on Leading HPC ArchitecturesLeonid Oliker

http://crd.lbl.gov/~oliker

Julian Borrill, Jonathan CarterLawrence Berkeley National Laboratories

Overview

Superscalar cache-based architectures dominate HPC market Leading architectures are commodity-based SMPs due to generality and

perception of cost effectiveness Growing gap between peak & sustained performance is well known in

scientific computing Modern parallel vectors may bridge gap this for many important

applications In April 2002, the Earth Simulator (ES) became operational:

Peak ES performance > all DOE and DOD systems combined Demonstrated high sustained performance on demanding scientific apps

Conducting evaluation study of scientific applications on modern vector systems

09/2003 MOU between ES and NERSC was completedFirst visit to ES center: Dec 2003, second visit Oct 2004 (no remote access)First international team to conduct performance evaluation study at ES

Examining best mapping between demanding applications and leading HPC systems - one size does not fit all

Vector Paradigm

High memory bandwidth• Allows systems to effectively feed ALUs (high byte to flop ratio)

Flexible memory addressing modes• Supports fine grained strided and irregular data access

Vector Registers• Hide memory latency via deep pipelining of memory load/stores

Vector ISA• Single instruction specifies large number of identical operations

Vector architectures allow for:• Reduced control complexity • Efficiently utilize large number of computational resources• Potential for automatic discovery of parallelism

However: most effective if sufficient regularity discoverable in program

structure• Suffers even if small % of code non-vectorizable (Amdahl’s Law)

Architectural Comparison

Node Type Where CPU/

NodeClockMHz

PeakGFlop

Mem BW GB/s

Peak byte/fl

op

NetwkBW

GB/s/P

BisectBW

byte/flop

MPI Latenc

yusec

NetworkTopolog

y

Power3 NERSC 16 375 1.5 1.0 0. 47 0.13 0.087 16.3 Fat-tree

Power4 ORNL 32 1300 5.2 2.3 0.44 0.13 0.025 7.0 Fat-tree

Altix ORNL 2 1500 6.0 6.4 1.1 0.40 0.067 2.8 Fat-treeES ESC 8 500 8.0 32.0 4.0 1.5 0.19 5.6 CrossbarX1 ORNL 4 800 12.8 34.1 2.7 6.3 0.088 7.3 2D-torus

Custom vector architectures have •High memory bandwidth relative to peak•Superior interconnect: latency, point to point, and bisection bandwidth

Another key balance point is I/O performance:

Seaborg I/O: 16 GFPS servers, each w/ 32 GB main memory (for caching & metadata) I/O uses switch fabric, sharing bandwidth with message-passing traffic

ES I/O: Each group 16 nodes has a pool of RAID disks attached with fiber channel switch (each node has a separate filesystem)

Previous ES visit

Tremendous potential of vector architectures: 4 codes running faster than ever before

Vector systems allows resolution not possible with scalar arch (regardless of # procs)

Opportunity to perform scientific runs at unprecedented scale• Evaluation codes contain sufficient regularity in computation for high vector

performance• However, none of the tested codes contained significant I/O requirements

Code(P=64) % peak (P=Max avail) Speedup ES

vs.

Pwr3 Pwr4 Altix ES X1 Pwr3 Pwr4 Altix X1

LBMHD 7% 5% 11% 58% 37% 30.6 15.3 7.2 1.5CACTUS 6% 11% 7% 34% 6% 45.0 5.1 6.4 4.0

GTC 9% 6% 5% 20% 11% 9.4 4.3 4.1 1.1PARATE

C 57% 33% 54% 58% 20% 8.2 3.9 1.4 3.9

Average 23.3 7.2 4.8 2.6

The Cosmic Microwave Background

The CMB is a snapshot of the Universe when it first became neutral 400,000 years after the Big Bang.

After Big Bang the expansion of space cooled Universe sufficiently for charged electrons and neutrons to combine

Cosmic - primordial photons filling all of space.

Microwave - redshifted by the expansion of the Universe from 3000K to 3K.

Background - coming from “behind” all astrophysical sources.

CMB Science

The CMB is a unique probe of the very early Universe.

Tiny fluctuations in its temperature & polarization encode

- the fundamental parameters of cosmology• Universe geometry, expansion rate, number of neutrino species,

ionization history, dark matter, cosmological constant - ultra-high energy physics beyond the Standard Model

CMB analysis moves from the time domain - observations - O(1012) to the pixel domain - maps - O(108) to the multipole domain - power spectra - O(104)calculating the compressed data and their

reduced error bars at each step.

CMB Data Analysis

MADCAP: Performance

Porting: ScaLAPACK plus rewrite of Legendre polynomial recursion, such that large batches are computed in inner loop

Original ES visit: only partially ported due to code’s requirements of global file system

Could not meet minimum parallelization and vectorization thresholds for ES

All systems sustain relatively low % peak considering MADCAP’s BLAS3 ops

Detailed analysis presented HiPC 2004

Further work performed for MADbench to: reduce I/O, remove system calls, and remove global file system requirements

New results collected from recent ES visit October 2004

PPower 3 Power4 ES X1

Gflops/P %peak Gflops/P %pea

k Gflops/P %peak Gflops/P %pea

k

16 0.62 41% 1.5 29% 4.1 32% 2.2 27%

64 0.54 36% 0.81 16% 1.9 23% 2.0 16%

IPM Overview

Integrated

Performance

Monitoring

portable, lightweight, scalable profiling

fast hash method

profiles MPI topology

profiles code regions

open source

MPI_Pcontrol(1,”W”); …code…MPI_Pcontrol(-1,”W”);

############################################ IPMv0.7 :: csnode041 256 tasks ES/ESOS# madbench.x (completed) 10/27/04/14:45:56## <mpi> <user> <wall> (sec)# 171.67 352.16 393.80 # …################################################ W# <mpi> <user> <wall> (sec)# 36.40 198.00 198.36## call [time] %mpi %wall# MPI_Reduce 2.395e+01 65.8 6.1# MPI_Recv 9.625e+00 26.4 2.4# MPI_Send 2.708e+00 7.4 0.7# MPI_Testall 7.310e-02 0.2 0.0# MPI_Isend 2.597e-02 0.1 0.0###############################################…

Is a lightweight version of the MADCAP maximum likelihood CMB power spectrum estimation code.

Retains the operational complexity & integrated system requirements of the full science code.

Has three basic steps - dSdC, invD & W.

Out of core calculation: holds approx 3 of the 50 matrices in memory

Is used for - computer & file-system procurements. - realistic scientific code benchmarking and optimization. - architectural comparisons.

MADbench

This step generates a set of Nb dense, symmetric NpxNp signal correlation derivative matrices dSdCb by Lengendre polynomial recursion.

Each matrix is block-cyclic distributed over the 2D processor array with blocksize B.

As each matrix is calculated, each processor writes its subset of the matrix elements to a unique file.

No inter-processor communication is required.

Flops: O(N2P) Disk: 8NbN2

p (primarily writing)

dSdC

This step generates the data correlation matrix D and inverts it.

The dSdCb matrices are read from disk one at a time and progressively accumulated to build the signal correlation matrix S.

A diagonal white noise correlation matrix N is added to S to give the data correlation matrix D, which is inverted using ScaLAPACK to give D-

1.

Each processor writes its subset of the D-1 matrix elements to a unique file.


p (primarily reading)

invD

This step multiplies each dSdCb matrix by D-1 to form Wb and derives a Newton-Raphson iterative step from this.

Since they are independent, these matrix multiplications can be carried out gang-parallel across Ng gangs of processors.

Each dSdCb matrix is read in by all processors and then redistributed to the target gang.

When all gangs have been given a matrix, they all perform their multiplication simultaneously.


p (primarily reading)

W

Np - number of pixels (matrix size).Nb - number of bins (matrix count).Ng - number of gangs of processors.B - ScaLAPACK blocksize.MODIO - IO concurrency control (only 1 in MODIO processors do IO

simultaneously).

Running on P processors requires: - 3 x 8 x Np

2 bytes of memory per gang - Nb x 8 x Np

2 bytes & Nb x P inodes of disk - Nb a multiple of Ng to load-balance the gangs.

B & MODIO are architecture-specific optimizations.

Parameters

dSdC performance

ES shows constant I/O performance (independent disks) Significantly fast computation (30X) due to high memory bandwidth Overall only 2.6X faster than Power3 due to I/O overhead

Power3 has faster write I/O until GPFS contention at P=1024

dSdC

0

10

20

30

40

50

60

70

80

P=16 P=16 P=64 P=64P=256 P=256

P=1024P=1024

Pwr3 ES Pwr3 ES Pwr3 ES Pwr3 ES

Seconds

CALC

MPI

I/O

invD performance

I/O remains relatively constant, while MPI overhead and computation grows Seaborg I/O reads faster than ES Overall ES only 2.3X faster

invD

0

50

100

150

200

P=16 P=16 P=64 P=64P=256 P=256

P=1024P=1024


Seconds

CALC

MPI

I/O

W performance

Multi-gang runs significantly reduce MPI overhead (4.8X on ES, 3.3X on Seaborg) MPI and CALC grow with numbers of processors I/O trivial part of W calculation Overall ES is 7X faster

W

0

500

1000

1500

2000

2500

P=16 P=16 P=64 P=64P=256 P=256

P=1024P=1024P=1024P=1024

G16 G16 G16 G16 G16 G16 G1 G1 G16 G16

Pwr3 ES Pwr3 ES Pwr3 ES Pwr3 ES Pwr3 ES

Seconds

CALCMPII/O

Performance overview

Overall ES 5.6X faster & slightly higher % of peak compared w/ Seaborg for P=1024

For P=256 Seaborg shows higher % peak, due to relative I/O vs. peak flop performance

Although I/O cost remains relatively high, both systems achieve over 50% peak

Overview

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

P=256 P=256 P=256 P=256 P=1024 P=1024 P=1024 P=1024


ES runtimes normalized to Power3

%Pk (no I/O)%Pk (total)CALCMPII/O

Overview

New version of Madbench successfully reduced I/O overhead and removed global file system requirements

Allowed ES runs up to 1024 processors, achieving over 50% of peak Compared with only 23% of peak on 64 processors from first visit

Results show that I/O has more effect on ES than Seaborg - due to ratio between I/O performance and peak ALU speed

Demonstrated IPM capabilities to measure MPI overhead on variety of architectures without the need to recompile, at a trivial runtime overhead (1-2%)

Continue study of complex interplay between architecture, interconnect, and I/O

Currently performing experiments on Columbia and Phoenix

MADbench and IPM being prepared for public distribution

Future CMB analysis will require sparse methods due to size of data sets - potentially at odds with vector architectures

Documents

Performance Characteristics of a Cosmology Package on Leading HPC Architectures Leonid Oliker oliker Julian Borrill, Jonathan Carter