21
1 Benchmark performance on Bassi Jonathan Carter User Services Group Lead [email protected] NERSC User Group Meeting June 12, 2006

1 Benchmark performance on Bassi Jonathan Carter User Services Group Lead [email protected] NERSC User Group Meeting June 12, 2006

Embed Size (px)

Citation preview

Page 1: 1 Benchmark performance on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006

1

Benchmark performance on Bassi

Jonathan CarterUser Services Group Lead

[email protected]

NERSC User Group MeetingJune 12, 2006

Page 2: 1 Benchmark performance on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006

2

Architectural Comparison

Node Type

Where NetworkCPU/Node

ClockMHz

PeakGFlop

Stream BW

GB/s/P

Peak byte/flo

p

MPIBW

GB/s/P

MPI Latency

sec

NetworkTopology

Power3 NERSC Colony 16 375 1.5 0.4 0.26 0.13 16.3 Fat-tree

Itanium2 LLNL Quadrics 4 1400 5.6 1.1 0.19 0.25 3.0 Fat-tree

Opteron NERSCInfiniBan

d2 2200 4.4 2.3 0.51 0.59 6.0 Fat-tree

Power5 NERSC HPS 8 1900 7.6 6.8 0.85 0.69 4.7 Fat-tree

X1E ORNL Custom 4 1130 18.0 9.7 0.54 2.9 5.04D-

Hypercube

ES ESC IN 8 1000 8.0 26.3 3.29 1.5 5.6 Crossbar

SX-8 HLRS INX 8 2000 16.0 41.0 2.56 2.0 5.0 Crossbar

Page 3: 1 Benchmark performance on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006

3

NERSC 5 Application Benchmarks

• CAM3– Climate model, NCAR

• GAMESS– Computational chemistry, Iowa State, Ames Lab

• GTC– Fusion, PPPL

• MADbench– Astrophysics (CMB analysis), LBL

• Milc– QCD, multi-site collaboration

• Paratec– Materials science,developed LBL and UC Berkeley

• PMEMD– Computational chemistry, University of North Carolina-Chapel Hill

Page 4: 1 Benchmark performance on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006

4

Application Summary

Application Science Area

Basic Algorithm

Language Library Use

Comment

CAM3 Climate(BER)

CFD, FFT FORTRAN 90 netCDF IPCC

GAMESS Chemistry(BES)

DFT FORTRAN 90 DDI, BLAS

GTC Fusion(FES)

Particle-in-cell

FORTRAN 90 FFT(opt) ITER emphasis

MADbench Astrophysics(HEP & NP)

Power Spectrum Estimation

C Scalapack 1024 proc. 730 MB per task, 200 GB disk

MILC QCD(NP)

Conjugate gradient

C none 2048 proc. 540 MB per task

PARATEC Materials(BES)

3D FFT FORTRAN 90 Scalapack Nanoscience emphasis

PMEMD Life Science(BER)

Particle Mesh Ewald

FORTRAN 90 none

Page 5: 1 Benchmark performance on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006

5

CAM3

• Community Atmospheric Model version 3– Developed at NCAR with substantial DOE input, both scientific and

software.

• The atmosphere model for CCSM, the coupled climate system model.– Also the most timing consuming part of CCSM. – Widely used by both American and foreign scientists for climate

research. • For example, Carbon, bio-geochemistry models are built upon  (integrated

with) CAM3.• IPCC predictions use CAM3 (in part)

– About 230,000 lines codes in Fortran 90.

• 1D Decomposition, runs up to 128 processors at T85 resolution (150Km)

• 2D Decomposition, runs up to 1680 processors at 0.5 deg (60Km) resolution.

Page 6: 1 Benchmark performance on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006

6

CAM3: Performance

P

Power3 Seaborg

Itanium2 Thunder

OpteronJacquard

Power5Bassi

GFs/P %pk GFs/P %pk GFs/P %pk GFs/P %pk

56 0.22 15% 0.35 6% 0.93 12%240 0.18 13% 0.38 6% 0.83 11%

Page 7: 1 Benchmark performance on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006

7

GAMESS

• Computational chemistry application – Variety of electronic structure algorithms available

• About 550,000 lines of Fortran 90• Communication layer makes use of highly

optimized vendor libraries• Many methods available within the code

– Benchmarks are DFT energy and gradient calculation, MP2 energy and gradient calculation

– Many computational chemistry studies rely on these techniques

• Exactly the same as DOD HPCMP TI-06 GAMESS benchmark– Vendors will only have to do the work once

Page 8: 1 Benchmark performance on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006

8

GAMESS: Performance

P

Power3 Seaborg

Itanium2 Thunder

OpteronJacquard

Power5Bassi

GFs/P %pk GFs/P %pk GFs/P %pk GFs/P %pk

64 0.02 1% 0.07 1% 0.07 2% 0.06 1%

384 0.03 2% 0.32 5% 0.31 4%

• Small case: large, messy, low computational-intensity kernels problematic for compilers

• Large case depends on asynchronous messaging

Page 9: 1 Benchmark performance on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006

9

GTC

• Gyrokinetic Toroidal Code

• Important code for Fusion SciDAC Project and for the International Fusion collaboration ITER.

• Transport of thermal energy via plasma microturbulence using particle-in-cell approach (PIC)3D visualization of electrostatic

potential in magnetic fusion device

Page 10: 1 Benchmark performance on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006

10

GTC: Performance

PPower3 Seaborg

Itanium2 Thunder

OpteronJacquard

Power5Bassi

X1EPhoenix

SX6 ES

SX8HLRS

GFs/P %pk GFs/P %pk GFs/P %pk GFs/P %pk GFs/P %pk GFs/P %pk GFs/P %pk

64 0.15 10% 0.51 9% 0.64 15% 0.72 9% 1.7 10% 1.9 23% 2.3 14%256

0.13 8% 0.44 7% 0.58 13% 0.68 9% 1.7 10% 1.8 22% 2.3 15%

• SX8 highest raw performance (ever) but lower efficiency than ES

• Scalar architectures suffer from low computational intensity, irregular data access, and register spilling

• Opteron/IB is 50% faster than Itanium2/Quadrics and only 1/2 speed of X1

– Opteron: on-chip memory controller and caching of FP L1 data

• X1 suffers from overhead of scalar code portions

Page 11: 1 Benchmark performance on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006

11

MADbench

• Cosmic microwave background radiation analysis tool (MADCAP)– Used large amount of time in FY04 and one of the

highest scaling codes at NERSC• MADBench is a benchmark version of the original

code – Designed to be easily run with synthetic data for

portability. – Used in a recent study in conjunction with Berkeley

Institute for Performance Studies (BIPS).• Written in C making extensive use of ScaLAPACK

libraries• Has extensive I/O requirements

Page 12: 1 Benchmark performance on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006

12

MADbench: Performance

P

Power3 Seaborg

Itanium2 Thunder

OpteronJacquard

Power5Bassi

GFs/P %pk GFs/P %pk GFs/P %pk GFs/P %pk

64 0.56 37% 2.6 43% 1.7 40% 4.1 54%256 0.50 34% 2.2 36% 1.8 40% 3.2 44%

2048 0.70 47% 1.6 27%

• Dominated by– Blas3– I/O

Page 13: 1 Benchmark performance on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006

13

MILC

• Quantum ChromoDynamics application– Widespread community use, large allocation– Easy to build, no dependencies, standards

conforming– Can be setup to run on wide-range of concurrency

• Conjugate gradient algorithm• Physics on a 4D lattice• Local computations are 3x3 complex matrix

multiplies, with sparse (indirect) access pattern

Page 14: 1 Benchmark performance on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006

14

MILC: Performance

P

Power3 Seaborg

Itanium2 Thunder

OpteronJacquard

Power5Bassi

GFs/P%pk GFs/P%pk GFs/P%pk GFs/P%pk

64 0.18 12% 0.26 4% 0.60 14% 1.35 18%256 0.14 9% 0.26 4% 0.51 12% 0.86 11%

2048 0.12 8% 0.25 4% 0.47 11%

Page 15: 1 Benchmark performance on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006

15

PARATEC

• Parallel Total Energy Code• Plane Wave DFT using custom 3D FFT • 70% of Materials Science Computation at

NERSC is done via Plane Wave DFT codes. PARATEC capture the performance of a wide range of codes (VASP, CPMD, PETOT).

Page 16: 1 Benchmark performance on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006

16

PARATEC: Performance

P

Power3 Seaborg

Itanium2 Thunder

OpteronJacquard

Power5Bassi

X1EPhoenix

SX6 ES

SX8HLRS

GFs/P %pk GFs/P %pk GFs/P %pk GFs/P %pk GFs/P %pk GFs/P %pk GFs/P %pk

64 0.60 40% 1.8 29% 2.3 53% 4.4 58% 3.8 21% 5.1 64% 7.5 49%

256 0.41 27% 0.79 13% 1.7 38% 3.3 43% 3.3 18% 5.0 62% 6.8 43%

• All architectures generally perform well due to computational intensity of code (BLAS3, FFT)

• SX8 achieves highest per-processor performance• X1/X1E shows lowest % of peak

– Non-vectorizable code much more expensive on X1/X1E (32:1)– Lower bisection bandwidth to computational ratio (4D-hypercube)– X1 Performance is comparable to Itanium2

• Itanium2 outperforms Opteron because– Paratec less sensitive to memory access issues (BLAS3)– Opteron lacks FMA unit– Quadrics shows better scaling of all-to-all at large concurrencies

Page 17: 1 Benchmark performance on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006

17

PMEMD

• Particle Mesh Ewald Molecular Dynamics– A F90 code with advanced MPI coding should

test compiler and stress asynchronous point to point messaging.

• PMEMD is very similar to the MD Engine in AMBER 8.0 used in both chemistry and biosciences

• Test system is a 91K atom blood coagulation protein

Page 18: 1 Benchmark performance on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006

18

PMEMD: Performance

P

Power3 Seaborg

Itanium2 Thunder

OpteronJacquard

Power5Bassi

GFs/P %pk GFs/P %pk GFs/P %pk GFs/P %pk

64 0.13 9% 0.21 3% 0.46 10% 0.52 7%256 0.05 3% 0.10 2% 0.19 4% 0.32 4%

Page 19: 1 Benchmark performance on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006

19

Summary

0.0

1000.0

2000.0

3000.0

4000.0

5000.0

6000.0

7000.0

8000.0

9000.0

10000.0

MIL

C M

MIL

C L

MIL

C XL

GTC M

GTC L

PARA M

PARA L

GAM M

GAM L

MAD M

MAD L

MAD X

L

PME M

PME L

CAM M

CAM L

seaborg

bassi

jacquard

thunder

Page 20: 1 Benchmark performance on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006

20

Summary

seaborg bassi jacquard thunder s/b

MILC M 1028.9 138 312 708.0 7.5

MILC L 9562.7 1496 2530 5069.0 6.4

MILC XL 12697.3 1945 3289 6129.0

GTC M 8236.9 1667 1876 2345.0 4.9

GTC L 9572.3 1790 2079 2759.0 5.3

PARA M 3306.4 451.0 861.0 1134.0 7.3

PARA L 6811.0 854.0 1654.0 3534.0 8.0

GAM M 18665.0 5837.0 5404.0 5277.0 3.2

GAM L 42167.0 4683.0 4516.0 9.0

MAD M 8013.9 1094.0 2585.0 1727.0 7.3

MAD L 8421.6 1277.0 2417.0 1942.0 6.6

MAD XL 2943.9 447.0 846.0 1291.0

PME M 2080.0 538 606 1344.0 3.9

PME L 3020.0 475 782 1541.0 6.4

CAM M 7932.8 1886.0 4988 4.2

CAM L 2439.0 527.0 1158.0 4.6

Page 21: 1 Benchmark performance on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006

21

Summary

• Average ratio bassi to seaborg is 6.0 for N5 application benchmarks