24
GACOP JACCA Meeting - February 27, 2004 PAL A New Approach in the System Software Design for Large-Scale Parallel Computers Juan Fernández 1,2 , Eitan Frachtenberg 1 , Fabrizio Petrini 1 , Salvador Coll 1 and José C. Sancho 1 1 Performance and Architecture Lab 2 Grupo de Arquitectura y Computación Paralela (GACOP) CCS-3 Modeling, Algorithms and Informatics Dpto. Ingeniería y Tecnología de Computadores Los Alamos National Laboratory, NM 87545, USA Universidad de Murcia, 30071 Murcia, SPAIN URL: http:// www .c3. lanl . gov URL: http:// www . ditec . um .es email:{juanf,eitanf,fabrizio,scoll,jcsancho}@lanl.gov

GACOP JACCA Meeting - February 27, 2004 P AL A New Approach in the System Software Design for Large-Scale Parallel Computers Juan Fernández 1,2, Eitan

Embed Size (px)

Citation preview

GACOP

JACCA Meeting - February 27, 2004

PAL

A New Approach in the System Software Designfor Large-Scale Parallel Computers

A New Approach in the System Software Designfor Large-Scale Parallel Computers

Juan Fernández1,2, Eitan Frachtenberg1, Fabrizio Petrini1,

Salvador Coll1 and José C. Sancho1

1Performance and Architecture Lab 2Grupo de Arquitectura y Computación Paralela (GACOP)

CCS-3 Modeling, Algorithms and Informatics Dpto. Ingeniería y Tecnología de Computadores

Los Alamos National Laboratory, NM 87545, USA Universidad de Murcia, 30071 Murcia, SPAIN

URL: http://www.c3.lanl.gov URL: http://www.ditec.um.es

email:{juanf,eitanf,fabrizio,scoll,jcsancho}@lanl.gov

GACOP

JACCA Meeting - February 27, 2004

PAL

MotivationMotivation

System software is a key factor to maximize usability, performance and scalability on large-scale systems!!!

Hardware / OSsHardware / OSsare glued together byare glued together by

System Software:System Software:Resource Management

CommunicationsParallel Developmentand Debugging ToolsParallel File System

Fault Tolerance

OSOS OSOSOSOS

OSOS

OSOS OSOSOSOS

OSOS

GACOP

JACCA Meeting - February 27, 2004

PAL

MotivationMotivation

System software complexity due to multiple factors: Extremely complex global state Non-deterministic behavior inherent to computing

systems and parallel apps Local OSs lack global awareness of parallel apps Independent design of different components User-level applications rely on system software

GACOP

JACCA Meeting - February 27, 2004

PAL

OutlineOutline

Motivation

Goals

Core Primitives

Resource Management

Communication Libraries

Ongoing and future work

GACOP

JACCA Meeting - February 27, 2004

PAL

Target Simplifying design and implementation of the system

software for large-scale parallel computers Simplicity, performance, scalability, determinism

Approach Built atop a basic set of three primitives Global synchronization/scheduling

Vision SIMD system running MIMD applications

(variable granularity in the order of hundreds of s)

GoalsGoals

GACOP

JACCA Meeting - February 27, 2004

PAL

OutlineOutline

Motivation

Goals

Core Primitives

Resource Management

Communication Libraries

Ongoing and future work

GACOP

JACCA Meeting - February 27, 2004

PAL

Core PrimitivesCore Primitives

System software built atop three primitives Xfer-And-Signal

– Transfer block of data to a set of nodes– Optionally signal local/remote event upon completion

Compare-And-Write– Compare global variable on a set of nodes– Optionally write global variable on the same set of nodes

Test-Event– Poll local event

GACOP

JACCA Meeting - February 27, 2004

PAL

Core PrimitivesCore Primitives

Characteristic Requirement Solution

Job Launching

Data dissemination

Flow Control

Termination Detection

Xfer-And-Signal

Compare-And-Write

Compare-And-Write

Job SchedulingHeartbeat

Context switch

responsiveness

Xfer-And-Signal

Prioritized messages /

Multiple Rails

Communication

PUT

GET

Barrier

Broadcast

Reduce

Xfer-And-Signal

Xfer-And-Signal

Compare-And-Write

Compare-And-Write+Xfer-And-Signal

Xfer-And-Signal / “Smart” NIC

The proposed mechanisms simplify design and implementation!!!

GACOP

JACCA Meeting - February 27, 2004

PAL

Core PrimitivesCore Primitives

Implementation Global, virtually addressable shared memory Remote Direct Memory Access (RDMA) Hardware-supported multicast Hardware-supported global query Computing capability in the NIC

Portability Infiniband, BlueGene/L, QsNET

GACOP

JACCA Meeting - February 27, 2004

PAL

OutlineOutline

Motivation

Goals

Core Primitives

Resource Management

Communication Libraries

Ongoing and future work

GACOP

JACCA Meeting - February 27, 2004

PAL

Resource ManagementResource Management STORM: Scalable TOol for Resource Management [1,2]

Job launching– binary and data dissemination– actual launching of a parallel job– reporting of job termination

Job scheduling– FCFS, gang scheduling, ... [3]– new scheduling algorithms can be “plugged”

Heartbeat/strobe at regular intervals (time slices) Monitoring Built atop the three core primitives

[1] “Scalable Resource Management in High Performance Computers.”

E. Frachtenberg, J. Fernández, F. Petrini, and S. Coll. Cluster´02.

[2] “STORM: Lightning-Fast Resource Management.” E. Frachtenberg, J. Fernández, F. Petrini, S. Pakin and S. Coll. SC´02.

[3] “Flexible CoScheduling: Mitigating Load Imbalance and Improving Utilization of Heterogeneous Resources.” E. Frachtenberg, D. G. Feitelson, F. Petrini and J. Fernández. IPDPS´03.

GACOP

JACCA Meeting - February 27, 2004

PAL

OutlineOutline

Motivation

Goals

Core Primitives

Resource Management

Communication Libraries

Ongoing and future work

GACOP

JACCA Meeting - February 27, 2004

PAL

Communication LibrariesCommunication Libraries BCS-MPI: Buffered Coscheduled MPI [4]

Global synchronization [5]– Heartbeat/strobe sent at regular intervals (time slices)– All system activities are tightly coupled

Global Scheduling– Exchange of communication requirements– Communication scheduling– Perform real transmission and reduce computations [6]

Implementation on the NIC (Elan3 - QsNet) Built atop the three core primitives

[4] “BCS-MPI: A New Approach in the System Software Design for Large-Scale Parallel Computers.” J. Fernández, E. Frachtenberg, and F. Petrini. SC´03.

[5] “Scalable Collective Communication on the ASCI Q Machine”J. Fernández, E. Frachtenberg, and F. Pettrini. HOTi´11.

[6] “Scalable NIC-based Reduction on Large-scale Clusters.” A. Moody, J. Fernández, F. Petrini and D. K. Panda. SC´03.

GACOP

JACCA Meeting - February 27, 2004

PAL

Communication LibrariesCommunication Libraries

•Global Strobe•(time slice starts)

•Global Strobe•(time slice ends)

Exchange of comm requirements

Communication scheduling

Real transmission

•Global•Synchronization

•Global•Synchronization

Tim

e S

lice

(h

un

dre

ds

of s

) BCS-MPI: real-time commication scheduling

GACOP

JACCA Meeting - February 27, 2004

PAL

Ongoing and future workOngoing and future work

Improved system utilization Scheduling multiple jobs

QoS for different types of traffic Scheduling messages may provide traffic segregation

Transparent fault tolerance [7] BCS MPI simplifies the state of the machine

Kernel-level implementation of BCS-MPI User-level solution is already working

Deterministic replay of MPI programs Ordered resource scheduling may enforce reproducibility

[7] “On the Feasibility of Incremental Checkpointing for Scientific Computing.” J. C. Sancho, F. Petrini, G. Johnson, J. Fernández and E. Frachtenberg. IPDPS´04.

GACOP

JACCA Meeting - February 27, 2004

PAL

A New Approach in the System Software Designfor Large-Scale Parallel Computers

A New Approach in the System Software Designfor Large-Scale Parallel Computers

Juan Fernández1,2, Eitan Frachtenberg1, Fabrizio Petrini1

1 Performance and Architecture Lab 2 Grupo de Arquitectura y Computación Paralelas (GACOP)

CCS-3 Modeling, Algorithms and Informatics Dpto. Ingeniería y Tecnología de Computadores

Los Alamos National Laboratory, NM 87545, USA Universidad de Murcia, 30071 Murcia, SPAIN

URL: http://www.c3.lanl.gov URL: http://www.ditec.um.es

email:{juanf,eitanf,fabrizio}@lanl.gov

GACOP

JACCA Meeting - February 27, 2004

PAL

MotivationMotivation

Characteristic Workstation Cluster

Job Launching Operating System Scripts/Middleware on top of the OS

Job Scheduling Timeshared by OSBatch queued or gang scheduled

(with large quanta) using middleware

CommunicationIPC/Shared

Memory

Message Passing Library (e.g. MPI) /

Data-Parallel Programming (e.g. HPF)

Fault Tolerance Little or noneApplication/application-assisted

checkpointing

StorageStandard file

systemCustom parallel file system

DebuggabilityStandard tools:

Reproducibility!!!

Parallel debugging tools:

Non-determinism!!!

Growing gap between workstation and cluster usability!!!

GACOP

JACCA Meeting - February 27, 2004

PAL

MotivationMotivation System software complexity due to multiple factors:

Extremely complex global stateThousands of processes, threads, open files, pending

messages, etc. Non-deterministic behavior

Inherent to computing systems OS process schedulingInduced by parallel applications MPI_ANY_SOURCE

Local OSs lack global awareness of parallel applications Interferences with fine-grain synchronization operations non-

scalable collective communication primitives Independent design of different components

Redundancy of functionality Communication protocolsMissing functionality QoS user-level traffic / system-level

traffic User-level applications rely on system software

System software performance/scalability impacts user-application performance/scalability

GACOP

JACCA Meeting - February 27, 2004

PAL

Resource ManagementResource Management

Job Launching

STORM is 40 times faster than the best reported result!!!

GACOP

JACCA Meeting - February 27, 2004

PAL

Resource ManagementResource Management

Job Scheduling

STORM is able to use very small time slices: RESPONSIVENESS !!!

GACOP

JACCA Meeting - February 27, 2004

PAL

Communication LibrariesCommunication Libraries

Non-blocking primitives: MPI_Isend/Irecv

GACOP

JACCA Meeting - February 27, 2004

PAL

Communication LibrariesCommunication Libraries

Blocking primitives: MPI_Send/Recv

GACOP

JACCA Meeting - February 27, 2004

PAL

Communication LibrariesCommunication Libraries

Global Synchronization Protocol Global Message Scheduling Phase

– Microphases: Descriptor Exchange + Message Scheduling Message Transmission Phase:

– Microphases: Point-to-point, Barrier and Broadcast, Reduce

GACOP

JACCA Meeting - February 27, 2004

PAL

Communication LibrariesCommunication Libraries

SAGE- timing.input (IA32)

0.5% SPEEDUP !!!