39
Programming Environment & Training (PET) 14 - 15 Sep 99 1 allel I/O for Distributed Applicat Dr Graham E Fagg Innovative Computing Laboratory University of Tennessee Knoxville, TN 37996-1301

Programming Environment & Training (PET) 14 - 15 Sep 99 1 Parallel I/O for Distributed Applications Dr Graham E Fagg Innovative Computing Laboratory University

Embed Size (px)

Citation preview

Page 1: Programming Environment & Training (PET) 14 - 15 Sep 99 1 Parallel I/O for Distributed Applications Dr Graham E Fagg Innovative Computing Laboratory University

Programming Environment & Training (PET)

14 - 15 Sep 991

Parallel I/O for Distributed Applications

Dr Graham E FaggInnovative Computing Laboratory

University of TennesseeKnoxville, TN 37996-1301

Page 2: Programming Environment & Training (PET) 14 - 15 Sep 99 1 Parallel I/O for Distributed Applications Dr Graham E Fagg Innovative Computing Laboratory University

Programming Environment & Training (PET)

14 - 15 Sep 992

Overview

• Aims of Project

• Background on MPI_Connect (PVMPI)

• Year 4 work– Changes to MPI_Connect

• PVM

• SNIPE and RCDS

• Single Thread comms

• Multi-threaded comms

Page 3: Programming Environment & Training (PET) 14 - 15 Sep 99 1 Parallel I/O for Distributed Applications Dr Graham E Fagg Innovative Computing Laboratory University

Programming Environment & Training (PET)

14 - 15 Sep 993

Overview (cont)

• Parallel IO– Subsystems

• ROMIO (MPICH)

• High performance Platforms

• File handling and management– Pre-caching– IBP and SNIPE-Lite– Experimental system

Page 4: Programming Environment & Training (PET) 14 - 15 Sep 99 1 Parallel I/O for Distributed Applications Dr Graham E Fagg Innovative Computing Laboratory University

Programming Environment & Training (PET)

14 - 15 Sep 994

Overview (cont)

• Future Work– File handling and exemplars

• DoD benefits

• Changes to milestones

• Additional Comments

• Conclusions

Page 5: Programming Environment & Training (PET) 14 - 15 Sep 99 1 Parallel I/O for Distributed Applications Dr Graham E Fagg Innovative Computing Laboratory University

Programming Environment & Training (PET)

14 - 15 Sep 995

Aims of Project

• Continue development of MPI_Connect– Fix bugs

• Like async non-blocking message problems

– Enhance features as requested by users– Support Parallel IO (as in MPI-2 Parallel IO)– Support complex file management across

systems and sites• As we already support computation why not the

input and results files as well?

Page 6: Programming Environment & Training (PET) 14 - 15 Sep 99 1 Parallel I/O for Distributed Applications Dr Graham E Fagg Innovative Computing Laboratory University

Programming Environment & Training (PET)

14 - 15 Sep 996

Aims of Project (cont)

• Support better scheduling of application runs across sites and systems– I.e. gang scheduling of both processors and pre-

fetching of data (logicistical scheduling)

• Support CFD and CWO CTAs

• Training and outreach

• HPC challenges (SC98-SC99-SC200..)

Page 7: Programming Environment & Training (PET) 14 - 15 Sep 99 1 Parallel I/O for Distributed Applications Dr Graham E Fagg Innovative Computing Laboratory University

Programming Environment & Training (PET)

14 - 15 Sep 997

Background on MPI_Connect

• What is MPI_Connect?– System that allows two or more high

performance MPI applications to inter-operate across systems/sites.

– Allows each application to use the tuned vendor supplied MPI implementation which out forcing loss of local performance that occurs with systems like MPICH (p2) and Global MPI (nexus MPI).

Page 8: Programming Environment & Training (PET) 14 - 15 Sep 99 1 Parallel I/O for Distributed Applications Dr Graham E Fagg Innovative Computing Laboratory University

Programming Environment & Training (PET)

14 - 15 Sep 998

MPI_Connect

Coupled model exampleMPI Application Ocean Model MPI Application Air Model

MPI_COMM_WORLDMPI_COMM_WORLD

Global inter-communicator

air_comm ->

<- ocean_comm

Page 9: Programming Environment & Training (PET) 14 - 15 Sep 99 1 Parallel I/O for Distributed Applications Dr Graham E Fagg Innovative Computing Laboratory University

Programming Environment & Training (PET)

14 - 15 Sep 999

MPI_Connect

• Application developer just adds three extra calls to an application to allow it to inter-operate with any other application.– MPI_Conn_register, MPI_Conn_intercomm_create,

MPI_Conn_remove

• Once above calls added normal MPI point-2-point calls can be used to send message between systems.– Only requirements are that they can both access a

common name service (usually via IP) and that the MPI implementation has a profiling layer available.

Page 10: Programming Environment & Training (PET) 14 - 15 Sep 99 1 Parallel I/O for Distributed Applications Dr Graham E Fagg Innovative Computing Laboratory University

Programming Environment & Training (PET)

14 - 15 Sep 9910

MPI_Connect

MPI_function

Users CodeIntercomm Library

Return code

Look up communicators etc

If true MPI intracommthen use profiled MPI call

PMPI_Function

Else translate into SNIPE/PVMaddressing and use SNIPE/PVMfunctions other libraryWork out correct returncode

Page 11: Programming Environment & Training (PET) 14 - 15 Sep 99 1 Parallel I/O for Distributed Applications Dr Graham E Fagg Innovative Computing Laboratory University

Programming Environment & Training (PET)

14 - 15 Sep 9911

Year 4 workChanges to MPI_Connect

• PVM– Worked well for the SC98 High Performance

Computing Challenge demo.

• BUT– Not everything worked as well as it should

• PVM got in the way of the IBMs POE job control system.

• Async messages were not non-blocking asyncronous– As discovered by SPLICE team.

Page 12: Programming Environment & Training (PET) 14 - 15 Sep 99 1 Parallel I/O for Distributed Applications Dr Graham E Fagg Innovative Computing Laboratory University

Programming Environment & Training (PET)

14 - 15 Sep 9912

MPI_Connect and SNIPE

• SNIPE (Scalable Networked Information Processing Environment)– Was seen as a replacement for PVM

• No central point of failure

• High speed reliable communications with limited QoS

• Powerful, Secure MetaData service based on RCDS

Page 13: Programming Environment & Training (PET) 14 - 15 Sep 99 1 Parallel I/O for Distributed Applications Dr Graham E Fagg Innovative Computing Laboratory University

Programming Environment & Training (PET)

14 - 15 Sep 9913

MPI_Connect and SNIPE

• SNIPE used RCDS for its name service– This worked on the Cray SGI Origin and IBM

SP systems but did not and still does not work on the Cray T3E (jim).

• Solution– Kept the communications (SNIPE_Lite) and

dropped RCDS for a custom name service daemon (more on this later).

Page 14: Programming Environment & Training (PET) 14 - 15 Sep 99 1 Parallel I/O for Distributed Applications Dr Graham E Fagg Innovative Computing Laboratory University

Programming Environment & Training (PET)

14 - 15 Sep 9914

MPI_Connect and single threaded communications

• SNIPE_Lite communications library was by default single threaded

– Note single threaded nexus is also called nexuslite.

• This meant that asynchronous non-blocking calls just became non-blocking and no progress could be made while outside of an MPI call

• (just as in the PVM case when using direct IP sockets)

Page 15: Programming Environment & Training (PET) 14 - 15 Sep 99 1 Parallel I/O for Distributed Applications Dr Graham E Fagg Innovative Computing Laboratory University

Programming Environment & Training (PET)

14 - 15 Sep 9915

Single threaded communicationsMessage sent in MPI_Isend()Between different systems

Each time an MPI call is called the sender can check the out going socket andForce some more data through it. The socket should be marked nonblockingSo that the MPI application cannot be deadlocked due to the actions of anExternal system. I.e. the system does not make progress.

When the users application does a MPI_Wait() this communication is forcedThrough to completion.

Page 16: Programming Environment & Training (PET) 14 - 15 Sep 99 1 Parallel I/O for Distributed Applications Dr Graham E Fagg Innovative Computing Laboratory University

Programming Environment & Training (PET)

14 - 15 Sep 9916

Multi-thread Communications

• Solution was to use multi-threaded communications for external communications.– 3 threads in initial implementation

• 1 send thread

• 1 receive thread

• 1 control thread that handles name service requests, setting up connections to external applications

Page 17: Programming Environment & Training (PET) 14 - 15 Sep 99 1 Parallel I/O for Distributed Applications Dr Graham E Fagg Innovative Computing Laboratory University

Programming Environment & Training (PET)

14 - 15 Sep 9917

Multi-threaded Communications

• How it works?– Sends put message descriptions onto a send-queue

– Receive operation put requests on a receive queue

– If the operation is blocking then the caller is suspended until the a condition arrises that would wake them up (using condition variables)

• While the main thread continues after ‘posting’ a non-blocking operation the threading library steals cycles to send/receive the message.

Page 18: Programming Environment & Training (PET) 14 - 15 Sep 99 1 Parallel I/O for Distributed Applications Dr Graham E Fagg Innovative Computing Laboratory University

Programming Environment & Training (PET)

14 - 15 Sep 9918

Multi-threaded communicationsPerformance

Test done by posting aNon-blocking send and measuring the number of operations the main threadcould perform while waiting on the ‘send’ to complete.

System switched to non-blocking TCP sockets when more than one external connecton was open.

Page 19: Programming Environment & Training (PET) 14 - 15 Sep 99 1 Parallel I/O for Distributed Applications Dr Graham E Fagg Innovative Computing Laboratory University

Programming Environment & Training (PET)

14 - 15 Sep 9919

Parallel IO

• Parallel IO allows parallel users applications access to large volums of data in such a way that by avoiding sharing, throughput can be increased by optimsations at the OS and H/W architecture levels.

• MPI-2 provides an API for access high performance Parallel IO subsystems.

Page 20: Programming Environment & Training (PET) 14 - 15 Sep 99 1 Parallel I/O for Distributed Applications Dr Graham E Fagg Innovative Computing Laboratory University

Programming Environment & Training (PET)

14 - 15 Sep 9920

Parallel IO

• Most Parallel IO implementations are built from ROMIO a model implementation supplied with MPICH 1.1.2.

• SGI Origin at CEWES is MPT 1.2.1.0 and the version needed is MPT 1.3.

• Cray T3E (jim) is MPT 1.2.1.3 but (1.3 and 1.3.01 (patched) are reported by the system)

Page 21: Programming Environment & Training (PET) 14 - 15 Sep 99 1 Parallel I/O for Distributed Applications Dr Graham E Fagg Innovative Computing Laboratory University

Programming Environment & Training (PET)

14 - 15 Sep 9921

File handling and Management

• MPI_Connect handles the communication between separate MPI applications BUT it does not handle the files that they work on or produce.

Aims of the second half of the project are to provide users of MPI_Connect the ability to share files across multiple system and sites in such a way that it complements their application execution and the use of MPI-2 Parallel IO.

Page 22: Programming Environment & Training (PET) 14 - 15 Sep 99 1 Parallel I/O for Distributed Applications Dr Graham E Fagg Innovative Computing Laboratory University

Programming Environment & Training (PET)

14 - 15 Sep 9922

File handling and Management

• This project should produce tools that allow applications to share whole files, part of files and allow these files to accessed by running application no matter where they execute. – If an application runs a CEWES or ASC, at the

beginning of the run, the input file should be in a single location and at the end of the run the result file should be also in a single location regardless of where the application executed.

Page 23: Programming Environment & Training (PET) 14 - 15 Sep 99 1 Parallel I/O for Distributed Applications Dr Graham E Fagg Innovative Computing Laboratory University

Programming Environment & Training (PET)

14 - 15 Sep 9923

File handling and Management

• Two systems considered.– Internet Backplane Protocol IBP

• Part of the Internet 2 project Distributed Storage Infrastructure (I2DSI).

• Code developed at UTK. Tested on the I2 system.

– SNIPE_Lite store&forward daemon (SFD)• SNIPE_Lite already used by MPI_Connect.

• Code developed at UTK.

Page 24: Programming Environment & Training (PET) 14 - 15 Sep 99 1 Parallel I/O for Distributed Applications Dr Graham E Fagg Innovative Computing Laboratory University

Programming Environment & Training (PET)

14 - 15 Sep 9924

File handling and management

System uses five extra command:MPI_Conn_getfile

MPI_Conn_getfile_viewMPI_Conn_putfileMPI_Conn_putfile_viewMPI_Conn_releasefile

Getfile gets a file from a central location into the local ‘parallel’ filesystem.

Putfile puts a file in a local filesystem into a central location_View versions work on subsets of files.

Page 25: Programming Environment & Training (PET) 14 - 15 Sep 99 1 Parallel I/O for Distributed Applications Dr Graham E Fagg Innovative Computing Laboratory University

Programming Environment & Training (PET)

14 - 15 Sep 9925

File handling and management

• Example code:MPI_Init(&argc, &argv)……MPI_Conn_getfile_view (mydata, myworkdata,me, num_of_apps, & size);

/* Get my part of the file called mydata and call it myworkdata */

…MPI_File_open (MCW, myworkdata,…..)

/* file is now available via MPI-2 IO */

Page 26: Programming Environment & Training (PET) 14 - 15 Sep 99 1 Parallel I/O for Distributed Applications Dr Graham E Fagg Innovative Computing Laboratory University

Programming Environment & Training (PET)

14 - 15 Sep 9926

Test-bed

• Between two clusters of Sun UltraSparc systems.– MPI applications are implemented using

MPICH.– MPI Parallel IO is implemented using ROMIO– System is tested with both IBP and SNIPE_Lite

as the user API is the same.

Page 27: Programming Environment & Training (PET) 14 - 15 Sep 99 1 Parallel I/O for Distributed Applications Dr Graham E Fagg Innovative Computing Laboratory University

Programming Environment & Training (PET)

14 - 15 Sep 9927

Example

Input file

MPI_App 1 MPI_App_2

File Supprt DaemonIBP or SLSFD

Page 28: Programming Environment & Training (PET) 14 - 15 Sep 99 1 Parallel I/O for Distributed Applications Dr Graham E Fagg Innovative Computing Laboratory University

Programming Environment & Training (PET)

14 - 15 Sep 9928

Example

Input file

MPI_App 1 MPI_App_2

File Supprt DaemonIBP or SLSFD

Getfile (0,2..) Getfile (1,2..)

Page 29: Programming Environment & Training (PET) 14 - 15 Sep 99 1 Parallel I/O for Distributed Applications Dr Graham E Fagg Innovative Computing Laboratory University

Programming Environment & Training (PET)

14 - 15 Sep 9929

ExampleInput file

MPI_App 1 MPI_App_2

File Supprt DaemonIBP or SLSFD

Getfile (0,2..) Getfile (1,2..)

Page 30: Programming Environment & Training (PET) 14 - 15 Sep 99 1 Parallel I/O for Distributed Applications Dr Graham E Fagg Innovative Computing Laboratory University

Programming Environment & Training (PET)

14 - 15 Sep 9930

ExampleInput file

MPI_App 1 MPI_App_2

File Supprt DaemonIBP or SLSFD

Getfile (0,2..) Getfile (1,2..)

File data passed in block fashion So that it does not overload daemon.

Page 31: Programming Environment & Training (PET) 14 - 15 Sep 99 1 Parallel I/O for Distributed Applications Dr Graham E Fagg Innovative Computing Laboratory University

Programming Environment & Training (PET)

14 - 15 Sep 9931

ExampleInput file

MPI_App 1 MPI_App_2

File Supprt DaemonIBP or SLSFD

Input file Input file

Files are now ready to be opened by the MPI-2 MPI_File_open function.

Page 32: Programming Environment & Training (PET) 14 - 15 Sep 99 1 Parallel I/O for Distributed Applications Dr Graham E Fagg Innovative Computing Laboratory University

Programming Environment & Training (PET)

14 - 15 Sep 9932

Future Work

• File handling (next half)– Change to the system so that shared files can be combined

and new views (access modes) created for the files on the fly.

• File handling (next year)– Change the system so that we can perform true prefeteching

of large data sets before the applications even start to execute. This would be performed by co-operation between the file handling daemons and site wide batch queue systems such as a modified version of NASA’s PBS.

Page 33: Programming Environment & Training (PET) 14 - 15 Sep 99 1 Parallel I/O for Distributed Applications Dr Graham E Fagg Innovative Computing Laboratory University

Programming Environment & Training (PET)

14 - 15 Sep 9933

DoD Benefits

• MPI_Connect – Is still allows independently MPI applications to

interoperate with little or no loss of local communication performance, allowing the solving of larger problems that in is possible with individual systems.

• MPI_Connect communication mode changes– These have allowed applications across systems to be

more flexible in the use of non-blocking messaging which increases overall communication performance, with out application developers having to code around potential effects of using MPI_Connect.

Page 34: Programming Environment & Training (PET) 14 - 15 Sep 99 1 Parallel I/O for Distributed Applications Dr Graham E Fagg Innovative Computing Laboratory University

Programming Environment & Training (PET)

14 - 15 Sep 9934

DoD benefits

• File handling and management– Allow simplified handling of files/data sets when

executing across multiple systems. Allows users to keep central repositories of data without having to collect files from multiple locations when running on non-local sites.

• MPI-2 IO support– Together with the file handling utilities allow users to

access to the new parallel-IO subsystems without hindering them. I.e. Just because you use MPI_Connect does not mean you should not use parallel IO!

Page 35: Programming Environment & Training (PET) 14 - 15 Sep 99 1 Parallel I/O for Distributed Applications Dr Graham E Fagg Innovative Computing Laboratory University

Programming Environment & Training (PET)

14 - 15 Sep 9935

Changes

• Moved away from using RCDS to custom name service.

• Pre-fetching only possible if user runs a script that uploads file and partition information into file handling daemon in advance of queuing computational jobs.

Page 36: Programming Environment & Training (PET) 14 - 15 Sep 99 1 Parallel I/O for Distributed Applications Dr Graham E Fagg Innovative Computing Laboratory University

Programming Environment & Training (PET)

14 - 15 Sep 9936

Additional Comments

• External MPI performance– Last years review reported very poor bandwidth between MSRC

sites when using MPI_Connect. Recent tests (August 1999) show this is no-longer the case. All tests performed between origin.wes.hpc.mil and other SGI origins at ARL or ASC.

– CEWES<->ASC (hpc03.asc.hpc.mil) 0.925 Mbytes/Sec

– CEWES<->ARL (adele1.arl.hpc.mil) 0.719 Mbytes/Sec

• Internal MPI performance under MPI_Connect– CEWES (origin) 77.2 Mbytes/sec

– ASC (hpc03) 77.4 Mbytes/sec

– ARL (adele1) 98.0 Mbytes/sec (faster newer machine)

Page 37: Programming Environment & Training (PET) 14 - 15 Sep 99 1 Parallel I/O for Distributed Applications Dr Graham E Fagg Innovative Computing Laboratory University

Programming Environment & Training (PET)

14 - 15 Sep 9937

Additional Comments

• Other than use as part of the DoD MSRC CEWES HPC challenge at SC98, MPI_Connect was recently used for a Challenge project at a DOE ASCI site. The computation involved over 5800 CPUs on a 40000 by 40000 system of linear equations accounting for almost 35,000 CPU hours in a single run. Thus proving its stability and low performance overheads when compared to other competing Meta-Computing middleware.

Page 38: Programming Environment & Training (PET) 14 - 15 Sep 99 1 Parallel I/O for Distributed Applications Dr Graham E Fagg Innovative Computing Laboratory University

Programming Environment & Training (PET)

14 - 15 Sep 9938

Additional Comments

• Need help from on-site leads with locating users who actually use Parallel IO from within MPI applications.– Many users agree that Parallel IO is important,

but few actually use it explicitly in their applications.

Page 39: Programming Environment & Training (PET) 14 - 15 Sep 99 1 Parallel I/O for Distributed Applications Dr Graham E Fagg Innovative Computing Laboratory University

Programming Environment & Training (PET)

14 - 15 Sep 9939

Conclusions

• MPI_Connect is still on target to meet milestone.• MPI_Connect has improved the communication

between external systems and no-longer needs users to run PVM.

• File management tools and new calls allow very simple handling of files across systems in a natural way accessible directly from within applications. This support also complements Parallel-IO rather than hinders its adoption and use.