Upload
charleen-patrick
View
213
Download
0
Embed Size (px)
Citation preview
Evaluation of Modern Parallel Vector Evaluation of Modern Parallel Vector ArchitecturesArchitectures
Leonid OlikerFuture Technologies Group
Computational Research Division
LBNLwww.nersc.gov/~oliker
Previous ResearchPrevious Research
Examined complex interactions between high-level algorithms, leading programming paradigms, and modern architectural platforms
Evaluated three parallelization strategies of a dynamic unstructured mesh adaptation algorithm
Examined two major classes of adaptive applications under three parallel programming model (UMA and N-Body)
Investigated effects of algorithmic orderings on sparse matrix computations
Evaluated performance of shared-virtual memory systems on PC-SMP clusters using six application kernels (structured and unstructured)
Architectures Examined: T3E, Origin2000, SP, PC Cluster, MTA
Examined scientific kernels on emerging microarchitectures: VIRAM (Berkeley PIM) and Imagine (Stanford Stream arch)
Programming Paradigms: MPI, OpenMP, hybrid, SHMEM, shared-memory, multithreading, vectorization, streaming
New Evaluation Project:New Evaluation Project:Modern Parallel Vector SystemsModern Parallel Vector Systems
Vector Architectures: SX6, X1, and ES
Plan to study key factors of modern parallel vector systems: runtime, scalability, programmability, portability, and memory overhead while identifying potential bottlenecks
Examine microbenchmarks, kernels, and application codes
What fraction of scientific codes suitable for these arch?What best programming paradigm?What required algorithmic modifications?What are scalability limiting factors?What migration issues in terms of performance portability?
Microbenchmark and Kernel CodesMicrobenchmark and Kernel Codes
Examine memory bandwidth within a node for simple and complex array addressing.
Examine low level message-passing characteristics:point-to-point, intra-node, extra-node, aggregate operations, and one-sided performance, as well as I/O
Task and thread performance: thread creation, task management locks, semaphores, and barriers. Explicit threads vs. implicit OpenMP
Evaluate NAS Parallel Benchmarks using MPI, OpenMP, and Hybrid programming. New class D and E size problems being developed by Rob Wijngaar at NASA Ames
Application CodesApplication Codes
Astrophysics:
MADCAP Microwave Anisotropy Dataset Computational Analysis Package. Analyses cosmic microwave background radiation datasets to extract the maximum likelihood angular power spectrum. Julian Borrill LBNL
CACTUS Direct evolution of Einstein's equations. Involves a coupled set of non-linear hyperbolic, elliptic equations with thousands of terms. John Shalf LBNL
Climate:
CCM3 Community Climate Model Michael Wehner LBNL Fluid Dynamics
OverflowD Overset Navier-Stokes grid solver. Simulates complex rotorcraft vortex dynamics problems. Mohammad Djomehri NASA
Application Codes (cont)Application Codes (cont)
Fusion
GTC Gyrokinetic Toroidal Code. 3D particle-in-cell code to study microturbulence in magnetic confinement fusion. Stephane Ethier Princeton Plasma Physics Laboratory
TLBE Thermal Lattice Boltzmann equation solver for modeling turbulence and collisions in plasma. Jonathan Carter LBNL
Material Science
PARATEC PARAllel Total Energy Code. Electronic structure code which performs ab-initio quantum-mechanical total energy calculations. Andrew Canning LBNL
Molecular Dynamics
NAMD Object-oriented molecular dynamics code designed for simulation of large biomolecular systems. David Skinner LBNL
Benchmarking TimelineBenchmarking Timelineand Evaluation Goalsand Evaluation Goals
Currently porting codes to single node SX6 (USA)
Will soon have multi-node SX6 access from DKRZ (Germany)
Early System Access to the Cray X1 expected in early February (ORNL)
Hope to gain Earth Simulator access summer 2003
Opportunity will allow us to compare performance and programmability with leading conventional architectures (Power4, Alpha EV67)
Allow comparison with significantly different X1 system: X1 vector pipes are “distributed” within the X1 multistreaming processor Cache based architecture and support for globally addressable memory Compiler must identify both streaming (microtasking) and vectorization, while
maximizing cache reuse Is the same programming style effective on both X1 and ES
Help guide future system acquisition and scientific code development
Potential to run applications at unprecedented scale