Evaluation of Modern Parallel Vector Architectures Leonid Oliker Future Technologies Group Computational Research Division LBNL oliker

Evaluation of Modern Parallel Vector Evaluation of Modern Parallel Vector ArchitecturesArchitectures

Leonid OlikerFuture Technologies Group

Computational Research Division

LBNLwww.nersc.gov/~oliker

Previous ResearchPrevious Research

Examined complex interactions between high-level algorithms, leading programming paradigms, and modern architectural platforms

Evaluated three parallelization strategies of a dynamic unstructured mesh adaptation algorithm

Examined two major classes of adaptive applications under three parallel programming model (UMA and N-Body)

Investigated effects of algorithmic orderings on sparse matrix computations

Evaluated performance of shared-virtual memory systems on PC-SMP clusters using six application kernels (structured and unstructured)

Architectures Examined: T3E, Origin2000, SP, PC Cluster, MTA

Examined scientific kernels on emerging microarchitectures: VIRAM (Berkeley PIM) and Imagine (Stanford Stream arch)

Programming Paradigms: MPI, OpenMP, hybrid, SHMEM, shared-memory, multithreading, vectorization, streaming

New Evaluation Project:New Evaluation Project:Modern Parallel Vector SystemsModern Parallel Vector Systems

Vector Architectures: SX6, X1, and ES

Plan to study key factors of modern parallel vector systems: runtime, scalability, programmability, portability, and memory overhead while identifying potential bottlenecks

Examine microbenchmarks, kernels, and application codes

What fraction of scientific codes suitable for these arch?What best programming paradigm?What required algorithmic modifications?What are scalability limiting factors?What migration issues in terms of performance portability?

Microbenchmark and Kernel CodesMicrobenchmark and Kernel Codes

Examine memory bandwidth within a node for simple and complex array addressing.

Examine low level message-passing characteristics:point-to-point, intra-node, extra-node, aggregate operations, and one-sided performance, as well as I/O

Task and thread performance: thread creation, task management locks, semaphores, and barriers. Explicit threads vs. implicit OpenMP

Evaluate NAS Parallel Benchmarks using MPI, OpenMP, and Hybrid programming. New class D and E size problems being developed by Rob Wijngaar at NASA Ames

Application CodesApplication Codes

Astrophysics:

MADCAP Microwave Anisotropy Dataset Computational Analysis Package. Analyses cosmic microwave background radiation datasets to extract the maximum likelihood angular power spectrum. Julian Borrill LBNL

CACTUS Direct evolution of Einstein's equations. Involves a coupled set of non-linear hyperbolic, elliptic equations with thousands of terms. John Shalf LBNL

Climate:

CCM3 Community Climate Model Michael Wehner LBNL Fluid Dynamics

OverflowD Overset Navier-Stokes grid solver. Simulates complex rotorcraft vortex dynamics problems. Mohammad Djomehri NASA

Application Codes (cont)Application Codes (cont)

Fusion

GTC Gyrokinetic Toroidal Code. 3D particle-in-cell code to study microturbulence in magnetic confinement fusion. Stephane Ethier Princeton Plasma Physics Laboratory

TLBE Thermal Lattice Boltzmann equation solver for modeling turbulence and collisions in plasma. Jonathan Carter LBNL

Material Science

PARATEC PARAllel Total Energy Code. Electronic structure code which performs ab-initio quantum-mechanical total energy calculations. Andrew Canning LBNL

Molecular Dynamics

NAMD Object-oriented molecular dynamics code designed for simulation of large biomolecular systems. David Skinner LBNL

Benchmarking TimelineBenchmarking Timelineand Evaluation Goalsand Evaluation Goals

Currently porting codes to single node SX6 (USA)

Will soon have multi-node SX6 access from DKRZ (Germany)

Early System Access to the Cray X1 expected in early February (ORNL)

Hope to gain Earth Simulator access summer 2003

Opportunity will allow us to compare performance and programmability with leading conventional architectures (Power4, Alpha EV67)

Allow comparison with significantly different X1 system: X1 vector pipes are “distributed” within the X1 multistreaming processor Cache based architecture and support for globally addressable memory Compiler must identify both streaming (microtasking) and vectorization, while

maximizing cache reuse Is the same programming style effective on both X1 and ES

Help guide future system acquisition and scientific code development

Potential to run applications at unprecedented scale

Documents

Evaluation of Modern Parallel Vector Architectures Leonid Oliker Future Technologies Group Computational Research Division LBNL oliker