Prof. Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University FT-MPICH : Providing...

Preview:

Citation preview

Prof. Heon Y. YeomDistributed Computing Systems Lab.Seoul National University

FT-MPICH : Providing fault

tolerance for MPI parallel applications

FT-MPICH : Providing fault

tolerance for MPI parallel applications

Heon Y. Yeom, Seoul National UniversityCondor Week 2006 yeom@snu.ac.kr

Motivation

Condor supports Checkpoint/Restart(C/R) mechanism only in Standard Universe. Single process jobs

C/R for parallel jobs is not provided in any of current Condor universes.

We would like to make C/R available for MPI programs.

Heon Y. Yeom, Seoul National UniversityCondor Week 2006 yeom@snu.ac.kr

Introduction

Why Message Passing Interface (MPI)? Designing a generic FT framework is

extremely hard due to the diversity of hardware and software systems.

We have chosen MPICH series ....

MPI is the most popular programming model in cluster computing.

Providing fault-tolerance to MPI is more cost-effective than providing it to the OS or hardware…

Heon Y. Yeom, Seoul National UniversityCondor Week 2006 yeom@snu.ac.kr

Architecture-Concept-

Monitoring

FailureDetection

C/R Protocol

FT-MPICHFT-MPICH

Heon Y. Yeom, Seoul National UniversityCondor Week 2006 yeom@snu.ac.kr

Architecture-Overall System-

Ethernet

IPC

Management System

Communication

MPI Process

Communication

IPC Ethernet

MPI Process

Communication

IPCEthernet

MPI Process

Communication

IPC

Ethernet

Message Queue

Heon Y. Yeom, Seoul National UniversityCondor Week 2006 yeom@snu.ac.kr

Management System

ManagementSystem

Makes MPI more reliable

FailureDetection

CheckpointCoordination

Recovery

InitializationCoordination

OutputManagement

CheckpointTransfer

Heon Y. Yeom, Seoul National UniversityCondor Week 2006 yeom@snu.ac.kr

Manager System

MPI processLocal Manager

MPI processLocal Manager

MPI processLocal Manager Stable

Storage

Leader Manager

Initialization, CKPT cmd, CKPT transfer, Failure notification & recovery

Communication between MPI process to exchange data

Heon Y. Yeom, Seoul National UniversityCondor Week 2006 yeom@snu.ac.kr

Fault-tolerant MPICH_P4

FT Module Recovery Module

ConnectionRe-establishment

Checkpoint Toolkit

Atomic M

essage

Transfer

ADI(Abstract Device Interface)Ch_p4 (Ethernet)

FT-MPICH

Ethernet

Collective Operations

P2P Operations

Heon Y. Yeom, Seoul National UniversityCondor Week 2006 yeom@snu.ac.kr

Startup in Condor

Precondition Leader Manager already knows the machines

where MPI process is executed and the number of MPI process by user input

Binary of Local Manager and MPI process is located at the same location of each machine

Heon Y. Yeom, Seoul National UniversityCondor Week 2006 yeom@snu.ac.kr

Startup in Condor

Job submission description file Vanilla Universe Shell script file is used in submission

description file executable points a shell script The shell file only executes Leader Manager

Ex) Example.cmd

#!/bin/shLeader_manager …

exe.sh(shell script)

universe = Vanillaexecutable = exe.shoutput = exe.outerror = exe.errlog = exe.logqueue

Example.cmd

Heon Y. Yeom, Seoul National UniversityCondor Week 2006 yeom@snu.ac.kr

Startup in Condor

User submits a job using condor_submitNormal Job Startup

Condor PoolCentral Manager

Submit Machine

Submit Shadow

Schedd

Negotiator Collector

Execute Machine

Job(Leader Manager)Starter

Startd

Heon Y. Yeom, Seoul National UniversityCondor Week 2006 yeom@snu.ac.kr

Startup in Condor

Leader Manager executes Local Manager Local Manager executes MPI process

Condor Pool

Central Manager

Submit MachineExecute Machine

Job(Leader Manager)

Execute Machine 1 Execute Machine 2 Execute Machine 3Local Manager Local Manager Local Manager

MPI Process MPI Process MPI Process

Fork()&

Exec()

Heon Y. Yeom, Seoul National UniversityCondor Week 2006 yeom@snu.ac.kr

Startup in Condor

MPI processes send Communication Info and Leader Manager aggregates this info

Leader Manager broadcasts aggregated info

Condor Pool

Central Manager

Submit MachineExecute Machine

Job(Leader Manager)

Execute Machine 1 Execute Machine 2 Execute Machine 3Local Manager Local Manager Local Manager

MPI Process MPI Process MPI ProcessMPI Process MPI Process MPI Process

Heon Y. Yeom, Seoul National UniversityCondor Week 2006 yeom@snu.ac.kr

Fault Tolerant MPI

To provide MPI fault-tolerance, we have adopted Coordinated checkpointing scheme

(vs. Independent scheme) The Leader Manager is the Coordinator!!

Application-level checkpointing (vs. kernel-level CKPT.)

This method does not require any efforts on the part of cluster administrators

User-transparent checkpointing scheme (vs. User-aware)

This method requires no modification of MPI source codes

Heon Y. Yeom, Seoul National UniversityCondor Week 2006 yeom@snu.ac.kr

Atomic Message Passing

Coordination between MPI process Assumption

Communication Channel is FIFO Lock(), Unlock()

To create atomic operation

Proc 1

Lock() Lock()

Unlock() Unlock()

AtomicRegion

CKPT SIG

CKPT SIG

CKPT SIG

Checkpoint is performed!!

Checkpoint is performed!!

Checkpoint is delayed!!

Proc 0

Heon Y. Yeom, Seoul National UniversityCondor Week 2006 yeom@snu.ac.kr

Atomic Message Passing(Case 1)

When MPI process receive CKPT SIG, MPI process send & receive barrier message

Proc 1

Lock() Lock()

Unlock() Unlock()

AtomicRegion

Proc 0CKPT SIGCKPT SIG

Barrier

Data

CKPT

CKPT SIGCKPT SIG CKPT

Heon Y. Yeom, Seoul National UniversityCondor Week 2006 yeom@snu.ac.kr

Atomic Message Passing (Case 2)

Through sending and receiving barrier message, In-transit message is pushed to the destination

Proc 1

Lock() Lock()

Unlock() Unlock()

AtomicRegion

Proc 0CKPT SIG

CKPT SIG

Barrier

Data

Delayed CKPT

Heon Y. Yeom, Seoul National UniversityCondor Week 2006 yeom@snu.ac.kr

Atomic Message Passing (Case 3)

The communication channel between MPI process is flushed Dependency between MPI process is removed

Proc 1

Lock() Lock()

Unlock() Unlock()

AtomicRegion

Proc 0

CKPT SIGCKPT SIG

Barrier

Data

Delayed CKPT

Heon Y. Yeom, Seoul National UniversityCondor Week 2006 yeom@snu.ac.kr

Checkpointing

Coordinated Checkpointing

ver 2

ver 1

LeaderManager

checkpointcommand

rank0 rank1 rank2 rank3

storage

Stack

Data

Text

Heap

Heon Y. Yeom, Seoul National UniversityCondor Week 2006 yeom@snu.ac.kr

Failure Recovery

MPI process recovery

Stack

Data

Text

Heap

Stack

Data

Text

Heap

CKPT Image New processRestarted Process

Heon Y. Yeom, Seoul National UniversityCondor Week 2006 yeom@snu.ac.kr

Failure Recovery

Connection Re-establishment Each MPI process re-opens socket and sends

IP, Port info to Local Manager This is the same with the one we did before at the initialization time.

Heon Y. Yeom, Seoul National UniversityCondor Week 2006 yeom@snu.ac.kr

Fault Tolerant MPI

Recovery from failure

failure detection

ver 1

LeaderManager

checkpointcommand

rank0 rank1 rank2 rank3

storage

Heon Y. Yeom, Seoul National UniversityCondor Week 2006 yeom@snu.ac.kr

Fault Tolerant MPI in Condor

Leader Manager controls MPI processes by issuing checkpoint command, monitoring

Condor Pool

Central Manager

Submit MachineExecute Machine

Job(Leader Manager)

Execute Machine 1 Execute Machine 2 Execute Machine 3Local Manager Local Manager Local Manager

MPI Process MPI Process MPI Process

Condor is not aware of the failure incident

Heon Y. Yeom, Seoul National UniversityCondor Week 2006 yeom@snu.ac.kr

Fault-tolerant MPICH-variants(Seoul National University)

FT Module Recovery Module

ConnectionRe-establishment

Ethernet

Checkpoint Toolkit

Atomic M

essage

Transfer

ADI(Abstract Device Interface)

Globus2 (Ethernet) GM (Myrinet) MVAPICH (InfiniBand)

Collective Operations

MPICH-GF

P2P Operations

M3 SHIELD

Myrinet InfiniBand

Heon Y. Yeom, Seoul National UniversityCondor Week 2006 yeom@snu.ac.kr

Summary

We can provide fault-tolerance for parallel applications using MPICH on Ethernet, Myrinet, and Infiniband.

Currently, only the P4(ethernet) version works with Condor.

We look forward to working with Condor team.

Recommended