98
Parallel I/O with MPI [email protected] CNRS - IDRIS PATC Training session Parallel filesystems and parallel IO libraries Maison de la Simulation / March 5th and 6th 2015 Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 1 / 55

Parallel I/O with MPI - PRACE Research Infrastructure · ... 11 12 call MPI_FILE_CLOSE(fh,code) 13 call MPI_FINALIZE(code) 14 15 end programopen01 ... Parallel I/O with MPI

Embed Size (px)

Citation preview

Parallel I/O with MPI

[email protected]

CNRS - IDRIS

PATC Training sessionParallel filesystems and parallel IO libraries

Maison de la Simulation / March 5th and 6th 2015

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 1 / 55

This document is extracted from the MPI course material written by I. Dupays,M. Flé, J. Gaidamour, D. Girou, P.-F. Lavallée, D. Lecas and P. Wautelet.

https://cours.idris.fr

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 2 / 55

Plan

1 Introduction

2 Blocking Data AccessFile ManipulationData access: ConceptsNoncollective data access

Data access with explicit offsetsData access with individual file pointers

Collective data accessData access with explicit offsetsData access with individual file pointersData access with shared file pointers

Positioning the file pointers

3 Derived DatatypesIntroductionSubarray Datatype ConstructorCommit and Free

4 File ViewsDefinitionReading non-overlapping sequences of data segments in parallel

5 Conclusion

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 3 / 55

Introduction

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 4 / 55

Introduction

Input/Output Optimisation

Applications which perform large calculations also tend to handle large amounts ofdata and generate a significant number of I/O requests.

Effective treatment of I/O can highly improve the global performances ofapplications.I/O tuning of parallel codes involves:

• Parallelizing I/O access of the program in order to avoid serial bottlenecks and to takeadvantage of parallel file systems

• Implementing efficient data access algorithms• Leveraging mechanisms implemented by the operating system (request grouping

methods, I/O buffers . . . ).

Libraries make I/O optimisations of parallel codes easier by providing ready-to-usecapabilities.

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 5 / 55

Introduction

The MPI-IO interface

The MPI-2 standard defines a set of functions designed to manage parallel I/O.

The I/O functions use well-known MPI concepts. For instance, collectives andnon-blocking operations on files and between MPI processes are similar. Files canalso be accessed in a patterned way using the existing derived datatypefunctionality.

Other concepts come from native I/O interfaces (file descriptors, attributes . . . ).

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 6 / 55

Introduction

Challenges

The goal of MPI-IO, via the high-level interface that it proposes, is to providesimplicity, expressiveness and flexibility to the user, while authorising optimisedimplementations which take into account the software and hardware specificitiesof I/O devices on the target machines.

It is the user’s responsibility to leverage some of the optimisations enabled by thelibrary (ex: collective or non-blocking I/O).

But many other essential optimisations implemented in the libraries improveperformances automatically.

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 7 / 55

Blocking Data Access

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 8 / 55

Opening and closing a file

Working with files

Opening and closing files are collective operations within the scope of acommunicator.

Opening a file generates a file handle, an opaque representation of the openedfile. File handles can be subsequently used to access files in MPI I/O subroutines.

Access modes describe the opening mode, access rights, etc. Modes arespecified at the opening of a file, using predefined MPI constants that can becombined together.

All the processes of the communicator participate in subsequent collectiveoperations.

We are only describing here the open/close subroutines but others filemanagement operations are available (preallocation, deletion, etc). For instance,MPI_FILE_GET_INFO() returns details on a file handle (information varies withimplementations).

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 9 / 55

Opening and closing a file

1 program open012 use mpi3 implicit none45 integer :: fh,code67 call MPI_INIT(code)89 call MPI_FILE_OPEN(MPI_COMM_WORLD,"file.data", &

10 MPI_MODE_RDWR + MPI_MODE_CREATE,MPI_INFO_NULL,fh,code)1112 call MPI_FILE_CLOSE(fh,code)13 call MPI_FINALIZE(code)1415 end program open01

> ls -l file.data

-rw------- 1 user grp 0 Feb 08 12:13 file.data

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 10 / 55

Opening and closing a file

Table: Access modes which can be defined at the opening of files

Mode Meaning

MPI_MODE_RDONLY Read only

MPI_MODE_RDWR Reading and writing

MPI_MODE_WRONLY Write only

MPI_MODE_CREATE Create the file if it does not exist

MPI_MODE_EXCL Error if creating file that already exists

MPI_MODE_UNIQUE_OPEN File will not be concurrently opened else-where

MPI_MODE_SEQUENTIAL File will only be accessed sequentially

MPI_MODE_APPEND Set initial position of all file pointers toend of file

MPI_MODE_DELETE_ON_CLOSE Delete file on close

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 11 / 55

Read/Write: Concepts

Data access routines

MPI-IO proposes a broad range of subroutines for transferring data between filesand memory.Subroutines can be distinguished through several properties:

• The position in the file can be specified using an explicit offset (ie. an absolute positionrelative to the beginning of the file) or using individual or shared file pointers (ie. theoffset is defined by the current value of pointers).

• Data access can be blocking or non-blocking.• Sending and receiving messages can be collective (in the communicator group) or

noncollective.

Different access methods may be mixed within the same program.

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 12 / 55

Table: Summary of the data access subroutines

Position-ing

Synchro-nism

Coordinationnoncollective collective

explicitoffsets

blocking MPI_FILE_READ_AT MPI_FILE_READ_AT_ALLMPI_FILE_WRITE_AT MPI_FILE_WRITE_AT_ALL

nonblocking

MPI_FILE_IREAD_AT MPI_FILE_READ_AT_ALL_BEGINMPI_FILE_READ_AT_ALL_END

MPI_FILE_IWRITE_AT MPI_FILE_WRITE_AT_ALL_BEGINMPI_FILE_WRITE_AT_ALL_END

see next page

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 13 / 55

Position-ing

Synchro-nism

Coordinationnoncollective collective

individualfilepointers

blocking MPI_FILE_READ MPI_FILE_READ_ALLMPI_FILE_WRITE MPI_FILE_WRITE_ALL

nonblocking

MPI_FILE_IREAD MPI_FILE_READ_ALL_BEGINMPI_FILE_READ_ALL_END

MPI_FILE_IWRITE MPI_FILE_WRITE_ALL_BEGINMPI_FILE_WRITE_ALL_END

sharedfilepointers

blocking MPI_FILE_READ_SHARED MPI_FILE_READ_ORDEREDMPI_FILE_WRITE_SHARED MPI_FILE_WRITE_ORDERED

nonblocking

MPI_FILE_IREAD_SHARED MPI_FILE_READ_ORDERED_BEGINMPI_FILE_READ_ORDERED_END

MPI_FILE_IWRITE_SHAREDMPI_FILE_WRITE_ORDERED_BEGINMPI_FILE_WRITE_ORDERED_END

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 14 / 55

Read/Write: Concepts

File Views

By default, files are treated as a sequence of bytes but access patterns can alsobe expressed using predefined or derived MPI datatypes.

This mechanism is called file views and is described in further detail later.

For now, we only need to know that the views rely on an elementary data type andthat the default type is MPI_BYTE.

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 15 / 55

Data access with explicit offsets

Explicit Offsets

Explicit offset operations perform data access directly at the file position, given asan argument.

The offset is expressed as a multiple of the elementary data type of the currentview (therefore, the default offset unit is bytes).

The datatype and the number of elements in the memory buffer are specified asarguments (ex: MPI_INTEGER)

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 16 / 55

Data access with explicit offsets

1 program write_at2 use mpi3 implicit none45 integer, parameter :: nb_values=106 integer :: i,rank,fh,code,bytes_in_integer7 integer(kind=MPI_OFFSET_KIND) :: offset8 integer, dimension(nb_values) :: values9 integer, dimension(MPI_STATUS_SIZE) :: status

1011 call MPI_INIT(code)12 call MPI_COMM_RANK(MPI_COMM_WORLD,rank,code)1314 values(:)= (/(i+rank*100,i=1,nb_values)/)15 print *, "Write process",rank, ":",values(:)1617 call MPI_FILE_OPEN(MPI_COMM_WORLD,"data.dat",MPI_MODE_WRONLY + MPI_MODE_CREATE, &18 MPI_INFO_NULL,fh,code)1920 call MPI_TYPE_SIZE(MPI_INTEGER,bytes_in_integer,code)2122 offset=rank*nb_values*bytes_in_integer23 call MPI_FILE_WRITE_AT(fh,offset,values,nb_values,MPI_INTEGER, &24 status,code)2526 call MPI_FILE_CLOSE(fh,code)27 call MPI_FINALIZE(code)28 end program write_at

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 17 / 55

Process 0 1 2 3 4 5 6 7 8 9 10

File

Process 1 101 102 103 104 105 106 107 108 109 110

1 2 3 4 5 6 7 8 9 10

z }| {

| {z }

1 2 3 4 5 6 7 8 9 10

101 102 103 104 105 106 107 108 109 110

| {z }

z }| {

101 102 103 104 105 106 107 108 109 110

Figure: MPI_FILE_WRITE_AT()

> mpiexec -n 2 write_at

Write process 0 : 1, 2, 3, 4, 5, 6, 7, 8, 9, 10Write process 1 : 101, 102, 103, 104, 105, 106, 107, 108, 109, 110

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 18 / 55

Process 0 1 2 3 4 5 6 7 8 9 10

File

Process 1 101 102 103 104 105 106 107 108 109 110

1 2 3 4 5 6 7 8 9 10

z }| {

| {z }

1 2 3 4 5 6 7 8 9 10

101 102 103 104 105 106 107 108 109 110

| {z }

z }| {

101 102 103 104 105 106 107 108 109 110

Figure: MPI_FILE_WRITE_AT()

> mpiexec -n 2 write_at

Write process 0 : 1, 2, 3, 4, 5, 6, 7, 8, 9, 10Write process 1 : 101, 102, 103, 104, 105, 106, 107, 108, 109, 110

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 18 / 55

Process 0 1 2 3 4 5 6 7 8 9 10

File

Process 1 101 102 103 104 105 106 107 108 109 110

1 2 3 4 5 6 7 8 9 10

z }| {

| {z }

1 2 3 4 5 6 7 8 9 10

101 102 103 104 105 106 107 108 109 110

| {z }

z }| {

101 102 103 104 105 106 107 108 109 110

Figure: MPI_FILE_WRITE_AT()

> mpiexec -n 2 write_at

Write process 0 : 1, 2, 3, 4, 5, 6, 7, 8, 9, 10Write process 1 : 101, 102, 103, 104, 105, 106, 107, 108, 109, 110

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 18 / 55

Data access with explicit offsets

1 program read_at23 use mpi4 implicit none56 integer, parameter :: nb_values=107 integer :: rank,fh,code,bytes_in_integer8 integer(kind=MPI_OFFSET_KIND) :: offset9 integer, dimension(nb_values) :: values

10 integer, dimension(MPI_STATUS_SIZE) :: status1112 call MPI_INIT(code)13 call MPI_COMM_RANK(MPI_COMM_WORLD,rank,code)1415 call MPI_FILE_OPEN(MPI_COMM_WORLD,"data.dat",MPI_MODE_RDONLY,MPI_INFO_NULL, &16 fh,code)1718 call MPI_TYPE_SIZE(MPI_INTEGER,bytes_in_integer,code)1920 offset=rank*nb_values*bytes_in_integer21 call MPI_FILE_READ_AT(fh,offset,values,nb_values,MPI_INTEGER, &22 status,code)23 print *, "Read process",rank,":",values(:)2425 call MPI_FILE_CLOSE(fh,code)26 call MPI_FINALIZE(code)2728 end program read_at

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 19 / 55

Process 0

File 1 2 3 4 5 6 7 8 9 10 101 102 103 104 105 106 107 108 109 110

Process 1

1 2 3 4 5 6 7 8 9 10

z }| {

| {z }

101 102 103 104 105 106 107 108 109 110

| {z }

z }| {

Figure: MPI_FILE_READ_AT()

> mpiexec -n 2 read_at

Read process 0 : 1, 2, 3, 4, 5, 6, 7, 8, 9, 10Read process 1 : 101, 102, 103, 104, 105, 106, 107, 108, 109, 110

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 20 / 55

Process 0

File 1 2 3 4 5 6 7 8 9 10 101 102 103 104 105 106 107 108 109 110

Process 1

1 2 3 4 5 6 7 8 9 10

z }| {

| {z }

101 102 103 104 105 106 107 108 109 110

| {z }

z }| {

Figure: MPI_FILE_READ_AT()

> mpiexec -n 2 read_at

Read process 0 : 1, 2, 3, 4, 5, 6, 7, 8, 9, 10Read process 1 : 101, 102, 103, 104, 105, 106, 107, 108, 109, 110

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 20 / 55

Process 0

File 1 2 3 4 5 6 7 8 9 10 101 102 103 104 105 106 107 108 109 110

Process 1

1 2 3 4 5 6 7 8 9 10

z }| {

| {z }

101 102 103 104 105 106 107 108 109 110

| {z }

z }| {

Figure: MPI_FILE_READ_AT()

> mpiexec -n 2 read_at

Read process 0 : 1, 2, 3, 4, 5, 6, 7, 8, 9, 10Read process 1 : 101, 102, 103, 104, 105, 106, 107, 108, 109, 110

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 20 / 55

Noncollective data access

Individual file pointers

MPI maintains one individual file pointer per process per file handle.

The current value of this pointer implicitly specifies the offset in the data accessroutines.

After an individual file pointer operation is initiated, the individual file pointer isupdated to point to the next data item.

The shared file pointer is neither used nor updated.

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 21 / 55

Noncollective data access

1 program read022 use mpi3 implicit none45 integer, parameter :: nb_values=106 integer :: rank,fh,code7 integer, dimension(nb_values) :: values=08 integer, dimension(MPI_STATUS_SIZE) :: status9

10 call MPI_INIT(code)11 call MPI_COMM_RANK(MPI_COMM_WORLD,rank,code)1213 call MPI_FILE_OPEN(MPI_COMM_WORLD,"data.dat",MPI_MODE_RDONLY,MPI_INFO_NULL, &14 fh,code)1516 if (rank == 0) then17 call MPI_FILE_READ(fh,values,5,MPI_INTEGER,status,code)18 else19 call MPI_FILE_READ(fh,values,8,MPI_INTEGER,status,code)20 call MPI_FILE_READ(fh,values,5,MPI_INTEGER,status,code)21 end if2223 print *, "Read process",rank,":",values(1:8)2425 call MPI_FILE_CLOSE(fh,code)26 call MPI_FINALIZE(code)27 end program read02

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 22 / 55

Process 0

File 1 2 3 4 5 6 7 8 9 10 101102103104105106107108109110

Process 1

0 0 0 0 0 0 0 0 0 01 2 3 4 5

z }| {

| {z }

0 01 2 3 4 5 6 7 81 2 3 4 5 6 7 8

| {z }

z }| {

0 01 2 3 4 5 6 7 81 2 3 4 5 6 7 89 10 1011021039 10 101102103

| {z }

z }| {

Figure: Example 2 of MPI_FILE_READ()

> mpiexec -n 2 read02

Read process 0 : 1, 2, 3, 4, 5, 0, 0, 0Read process 1 : 9, 10, 101, 102, 103, 6, 7, 8

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 23 / 55

Process 0

File 1 2 3 4 5 6 7 8 9 10 101102103104105106107108109110

Process 1

0 0 0 0 0 0 0 0 0 01 2 3 4 5

z }| {

| {z }

0 01 2 3 4 5 6 7 81 2 3 4 5 6 7 8

| {z }

z }| {

0 01 2 3 4 5 6 7 81 2 3 4 5 6 7 89 10 1011021039 10 101102103

| {z }

z }| {

Figure: Example 2 of MPI_FILE_READ()

> mpiexec -n 2 read02

Read process 0 : 1, 2, 3, 4, 5, 0, 0, 0Read process 1 : 9, 10, 101, 102, 103, 6, 7, 8

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 23 / 55

Process 0

File 1 2 3 4 5 6 7 8 9 10 101102103104105106107108109110

Process 1

0 0 0 0 0 0 0 0 0 01 2 3 4 5

z }| {

| {z }

0 01 2 3 4 5 6 7 81 2 3 4 5 6 7 8

| {z }

z }| {

0 01 2 3 4 5 6 7 81 2 3 4 5 6 7 89 10 1011021039 10 101102103

| {z }

z }| {

Figure: Example 2 of MPI_FILE_READ()

> mpiexec -n 2 read02

Read process 0 : 1, 2, 3, 4, 5, 0, 0, 0Read process 1 : 9, 10, 101, 102, 103, 6, 7, 8

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 23 / 55

Process 0

File 1 2 3 4 5 6 7 8 9 10 101102103104105106107108109110

Process 1

0 0 0 0 0 0 0 0 0 01 2 3 4 5

z }| {

| {z }

0 01 2 3 4 5 6 7 81 2 3 4 5 6 7 8

| {z }

z }| {

0 01 2 3 4 5 6 7 81 2 3 4 5 6 7 89 10 1011021039 10 101102103

| {z }

z }| {

Figure: Example 2 of MPI_FILE_READ()

> mpiexec -n 2 read02

Read process 0 : 1, 2, 3, 4, 5, 0, 0, 0Read process 1 : 9, 10, 101, 102, 103, 6, 7, 8

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 23 / 55

Collective data access

Collective data access

Collective operations require the participation of all the processes within thecommunicator group associated with the file handle.

Collective operations may perform much better than their noncollectivecounterparts, as global data accesses have significant potential for automaticoptimisation.

For the collective shared file pointer routines, the accesses to the file will be in theorder determined by the ranks of the processes within the group. The ordering istherefore deterministic.

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 24 / 55

Collective data access

1 program read_at_all2 use mpi3 implicit none45 integer, parameter :: nb_values=106 integer :: rank,fh,code,bytes_in_integer7 integer(kind=MPI_OFFSET_KIND) :: offset_file8 integer, dimension(nb_values) :: values9 integer, dimension(MPI_STATUS_SIZE) :: status

1011 call MPI_INIT(code)12 call MPI_COMM_RANK(MPI_COMM_WORLD,rank,code)1314 call MPI_FILE_OPEN(MPI_COMM_WORLD,"data.dat",MPI_MODE_RDONLY,MPI_INFO_NULL, &15 fh,code)1617 call MPI_TYPE_SIZE(MPI_INTEGER,bytes_in_integer,code)18 offset_file=rank*nb_values*bytes_in_integer19 call MPI_FILE_READ_AT_ALL(fh,offset_file,values,nb_values, &20 MPI_INTEGER,status,code)21 print *, "Read process",rank,":",values(:)2223 call MPI_FILE_CLOSE(fh,code)24 call MPI_FINALIZE(code)25 end program read_at_all

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 25 / 55

Process 0

File 1 2 3 4 5 6 7 8 9 10 101 102 103 104 105 106 107 108 109 110

Process 1

1 2 3 4 5 6 7 8 9 10

z }| {

| {z }

101 102 103 104 105 106 107 108 109 110

| {z }

z }| {

Figure: Example of MPI_FILE_READ_AT_ALL()

> mpiexec -n 2 read_at_all

Read process 0 : 1, 2, 3, 4, 5, 6, 7, 8, 9, 10Read process 1 : 101, 102, 103, 104, 105, 106, 107, 108, 109, 110

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 26 / 55

Process 0

File 1 2 3 4 5 6 7 8 9 10 101 102 103 104 105 106 107 108 109 110

Process 1

1 2 3 4 5 6 7 8 9 10

z }| {

| {z }

101 102 103 104 105 106 107 108 109 110

| {z }

z }| {

Figure: Example of MPI_FILE_READ_AT_ALL()

> mpiexec -n 2 read_at_all

Read process 0 : 1, 2, 3, 4, 5, 6, 7, 8, 9, 10Read process 1 : 101, 102, 103, 104, 105, 106, 107, 108, 109, 110

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 26 / 55

Process 0

File 1 2 3 4 5 6 7 8 9 10 101 102 103 104 105 106 107 108 109 110

Process 1

1 2 3 4 5 6 7 8 9 10

z }| {

| {z }

101 102 103 104 105 106 107 108 109 110

| {z }

z }| {

Figure: Example of MPI_FILE_READ_AT_ALL()

> mpiexec -n 2 read_at_all

Read process 0 : 1, 2, 3, 4, 5, 6, 7, 8, 9, 10Read process 1 : 101, 102, 103, 104, 105, 106, 107, 108, 109, 110

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 26 / 55

Collective data access

1 program read_all012 use mpi3 implicit none45 integer, parameter :: nb_values=106 integer :: read_sz,rank,fh,code7 integer, dimension(nb_values) :: values=08 integer, dimension(MPI_STATUS_SIZE) :: status9

10 call MPI_INIT(code)11 call MPI_COMM_RANK(MPI_COMM_WORLD,rank,code)1213 call MPI_FILE_OPEN(MPI_COMM_WORLD,"data.dat",MPI_MODE_RDONLY,MPI_INFO_NULL, &14 fh,code)1516 read_sz = 4+rank17 call MPI_FILE_READ_ALL(fh,values,read_sz,MPI_INTEGER,status,code)18 call MPI_FILE_READ_ALL(fh,values(1+read_sz),read_sz,MPI_INTEGER,status,code)1920 print *, "Read process",rank,":",values(:)2122 call MPI_FILE_CLOSE(fh,code)23 call MPI_FINALIZE(code)24 end program read_all01

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 27 / 55

Process 0

File 1 2 3 4 5 6 7 8 9 10 101102103104105106107108109110

Process 1

1 2 3 41 2 3 4

z }| {

| {z }

1 2 3 4 51 2 3 4 5

| {z }

z }| {

1 2 3 41 2 3 4 5 6 7 85 6 7 8

z }| {

| {z }

1 2 3 4 51 2 3 4 5 6 7 8 9 106 7 8 9 10

| {z }

z }| {

Figure: Example 1 of MPI_FILE_READ_ALL()

> mpiexec -n 2 read_all01

Read process 1 : 1, 2, 3, 4, 5, 6, 7, 8, 0, 0Read process 0 : 1, 2, 3, 4, 5, 6, 7, 8, 9, 10

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 28 / 55

Process 0

File 1 2 3 4 5 6 7 8 9 10 101102103104105106107108109110

Process 1

1 2 3 41 2 3 4

z }| {

| {z }

1 2 3 4 51 2 3 4 5

| {z }

z }| {

1 2 3 41 2 3 4 5 6 7 85 6 7 8

z }| {

| {z }

1 2 3 4 51 2 3 4 5 6 7 8 9 106 7 8 9 10

| {z }

z }| {

Figure: Example 1 of MPI_FILE_READ_ALL()

> mpiexec -n 2 read_all01

Read process 1 : 1, 2, 3, 4, 5, 6, 7, 8, 0, 0Read process 0 : 1, 2, 3, 4, 5, 6, 7, 8, 9, 10

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 28 / 55

Process 0

File 1 2 3 4 5 6 7 8 9 10 101102103104105106107108109110

Process 1

1 2 3 41 2 3 4

z }| {

| {z }

1 2 3 4 51 2 3 4 5

| {z }

z }| {

1 2 3 41 2 3 4 5 6 7 85 6 7 8

z }| {

| {z }

1 2 3 4 51 2 3 4 5 6 7 8 9 106 7 8 9 10

| {z }

z }| {

Figure: Example 1 of MPI_FILE_READ_ALL()

> mpiexec -n 2 read_all01

Read process 1 : 1, 2, 3, 4, 5, 6, 7, 8, 0, 0Read process 0 : 1, 2, 3, 4, 5, 6, 7, 8, 9, 10

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 28 / 55

Process 0

File 1 2 3 4 5 6 7 8 9 10 101102103104105106107108109110

Process 1

1 2 3 41 2 3 4

z }| {

| {z }

1 2 3 4 51 2 3 4 5

| {z }

z }| {

1 2 3 41 2 3 4 5 6 7 85 6 7 8

z }| {

| {z }

1 2 3 4 51 2 3 4 5 6 7 8 9 106 7 8 9 10

| {z }

z }| {

Figure: Example 1 of MPI_FILE_READ_ALL()

> mpiexec -n 2 read_all01

Read process 1 : 1, 2, 3, 4, 5, 6, 7, 8, 0, 0Read process 0 : 1, 2, 3, 4, 5, 6, 7, 8, 9, 10

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 28 / 55

Process 0

File 1 2 3 4 5 6 7 8 9 10 101102103104105106107108109110

Process 1

1 2 3 41 2 3 4

z }| {

| {z }

1 2 3 4 51 2 3 4 5

| {z }

z }| {

1 2 3 41 2 3 4 5 6 7 85 6 7 8

z }| {

| {z }

1 2 3 4 51 2 3 4 5 6 7 8 9 106 7 8 9 10

| {z }

z }| {

Figure: Example 1 of MPI_FILE_READ_ALL()

> mpiexec -n 2 read_all01

Read process 1 : 1, 2, 3, 4, 5, 6, 7, 8, 0, 0Read process 0 : 1, 2, 3, 4, 5, 6, 7, 8, 9, 10

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 28 / 55

Collective data access

Shared file pointer

MPI maintains only one shared file pointer per collective MPI_FILE_OPEN (sharedamong processes in the communicator group).

All processes must use the same file view.

For the noncollective shared file pointer routines, the serialisation ordering is notdeterministic. To enforce a specific order, the user needs to use othersynchronisation means or use collective variants.

After a shared file pointer operation, the shared file pointer is updated to point tothe next data item, that is, just after the last one accessed by the operation.

The individual file pointers are neither used nor updated.

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 29 / 55

Collective data access

1 program read_ordered2 use mpi3 implicit none45 integer :: rank,fh,code6 integer, parameter :: nb_values=107 integer, dimension(nb_values) :: values8 integer, dimension(MPI_STATUS_SIZE) :: status9

10 call MPI_INIT(code)11 call MPI_COMM_RANK(MPI_COMM_WORLD,rank,code)1213 call MPI_FILE_OPEN(MPI_COMM_WORLD,"data.dat",MPI_MODE_RDONLY,MPI_INFO_NULL, &14 fh,code)1516 call MPI_FILE_READ_ORDERED(fh,values,4,MPI_INTEGER,status,code)17 call MPI_FILE_READ_ORDERED(fh,values(5),6,MPI_INTEGER,status,code)1819 print *, "Read process",rank,":",values(:)2021 call MPI_FILE_CLOSE(fh,code)22 call MPI_FINALIZE(code)23 end program read_ordered

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 30 / 55

Process 0

File 1 2 3 4 5 6 7 8 9 10 101102103104105106107108109110

Process 1

1 2 3 41 2 3 4

z }| {

| {z }

5 6 7 85 6 7 8

| {z }

z }| {

1 2 3 41 2 3 4 9 10 1011021031049 10 101102103104

z }| {

| {z }

5 6 7 85 6 7 8 105106107108109110105106107108109110

| {z }

z }| {

Figure: Example of MPI_FILE_ORDERED()

> mpiexec -n 2 read_ordered

Read process 1 : 5, 6, 7, 8, 105, 106, 107, 108, 109, 110Read process 0 : 1, 2, 3, 4, 9, 10, 101, 102, 103, 104

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 31 / 55

Process 0

File 1 2 3 4 5 6 7 8 9 10 101102103104105106107108109110

Process 1

1 2 3 41 2 3 4

z }| {

| {z }

5 6 7 85 6 7 8

| {z }

z }| {

1 2 3 41 2 3 4 9 10 1011021031049 10 101102103104

z }| {

| {z }

5 6 7 85 6 7 8 105106107108109110105106107108109110

| {z }

z }| {

Figure: Example of MPI_FILE_ORDERED()

> mpiexec -n 2 read_ordered

Read process 1 : 5, 6, 7, 8, 105, 106, 107, 108, 109, 110Read process 0 : 1, 2, 3, 4, 9, 10, 101, 102, 103, 104

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 31 / 55

Process 0

File 1 2 3 4 5 6 7 8 9 10 101102103104105106107108109110

Process 1

1 2 3 41 2 3 4

z }| {

| {z }

5 6 7 85 6 7 8

| {z }

z }| {

1 2 3 41 2 3 4 9 10 1011021031049 10 101102103104

z }| {

| {z }

5 6 7 85 6 7 8 105106107108109110105106107108109110

| {z }

z }| {

Figure: Example of MPI_FILE_ORDERED()

> mpiexec -n 2 read_ordered

Read process 1 : 5, 6, 7, 8, 105, 106, 107, 108, 109, 110Read process 0 : 1, 2, 3, 4, 9, 10, 101, 102, 103, 104

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 31 / 55

Process 0

File 1 2 3 4 5 6 7 8 9 10 101102103104105106107108109110

Process 1

1 2 3 41 2 3 4

z }| {

| {z }

5 6 7 85 6 7 8

| {z }

z }| {

1 2 3 41 2 3 4 9 10 1011021031049 10 101102103104

z }| {

| {z }

5 6 7 85 6 7 8 105106107108109110105106107108109110

| {z }

z }| {

Figure: Example of MPI_FILE_ORDERED()

> mpiexec -n 2 read_ordered

Read process 1 : 5, 6, 7, 8, 105, 106, 107, 108, 109, 110Read process 0 : 1, 2, 3, 4, 9, 10, 101, 102, 103, 104

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 31 / 55

Process 0

File 1 2 3 4 5 6 7 8 9 10 101102103104105106107108109110

Process 1

1 2 3 41 2 3 4

z }| {

| {z }

5 6 7 85 6 7 8

| {z }

z }| {

1 2 3 41 2 3 4 9 10 1011021031049 10 101102103104

z }| {

| {z }

5 6 7 85 6 7 8 105106107108109110105106107108109110

| {z }

z }| {

Figure: Example of MPI_FILE_ORDERED()

> mpiexec -n 2 read_ordered

Read process 1 : 5, 6, 7, 8, 105, 106, 107, 108, 109, 110Read process 0 : 1, 2, 3, 4, 9, 10, 101, 102, 103, 104

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 31 / 55

Positioning the file pointers

Positioning the file pointers

MPI_FILE_GET_POSITION() and MPI_FILE_GET_POSITION_SHARED()return the current position of the individual pointers and the shared file pointer(respectively).MPI_FILE_SEEK() and MPI_FILE_SEEK_SHARED() update the file pointervalues by using the following possible modes:

• MPI_SEEK_SET: The pointer is set to offset.• MPI_SEEK_CUR: The pointer is set to the current pointer position plus offset.• MPI_SEEK_END: The pointer is set to the end of file plus offset.

The offset can be negative, which allows seeking backwards.

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 32 / 55

Positioning the file pointers

1 program seek2 use mpi3 implicit none4 integer, parameter :: nb_values=105 integer :: rank,fh,bytes_in_integer,code6 integer(kind=MPI_OFFSET_KIND) :: offset7 integer, dimension(nb_values) :: values8 integer, dimension(MPI_STATUS_SIZE) :: status9

10 call MPI_INIT(code)11 call MPI_COMM_RANK(MPI_COMM_WORLD,rank,code)12 call MPI_FILE_OPEN(MPI_COMM_WORLD,"data.dat",MPI_MODE_RDONLY,MPI_INFO_NULL, &13 fh,code)1415 call MPI_FILE_READ(fh,values,3,MPI_INTEGER,status,code)16 call MPI_TYPE_SIZE(MPI_INTEGER,bytes_in_integer,code)17 offset=8*bytes_in_integer18 call MPI_FILE_SEEK(fh,offset,MPI_SEEK_CUR,code)19 call MPI_FILE_READ(fh,values(4),3,MPI_INTEGER,status,code)20 offset=4*bytes_in_integer21 call MPI_FILE_SEEK(fh,offset,MPI_SEEK_SET,code)22 call MPI_FILE_READ(fh,values(7),4,MPI_INTEGER,status,code)2324 print *, "Read process",rank,":",values(:)2526 call MPI_FILE_CLOSE(fh,code)27 call MPI_FINALIZE(code)28 end program seek

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 33 / 55

Process 0

File 1 2 3 4 5 6 7 8 9 10 101 102 103 104 105 106 107 108 109 110

Process 1

1 2 31 2 31 2 3

z }| {

| {z }

1 2 31 2 31 2 3

| {z }

z }| {

1 2 31 2 31 2 3 102 103 104102 103 104102 103 104

z }| {

| {z }

1 2 31 2 31 2 3 102 103 104102 103 104102 103 104

| {z }

z }| {

1 2 31 2 31 2 3 102 103 104102 103 104102 103 104 5 6 7 85 6 7 85 6 7 8

z }| {

| {z }

1 2 31 2 31 2 3 102 103 104102 103 104102 103 104 5 6 7 85 6 7 85 6 7 8

| {z }

z }| {

Figure: Example of MPI_FILE_SEEK()

> mpiexec -n 2 seek

Read process 1 : 1, 2, 3, 102, 103, 104, 5, 6, 7, 8Read process 0 : 1, 2, 3, 102, 103, 104, 5, 6, 7, 8

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 34 / 55

Process 0

File 1 2 3 4 5 6 7 8 9 10 101 102 103 104 105 106 107 108 109 110

Process 1

1 2 31 2 31 2 3

z }| {

| {z }

1 2 31 2 31 2 3

| {z }

z }| {

1 2 31 2 31 2 3 102 103 104102 103 104102 103 104

z }| {

| {z }

1 2 31 2 31 2 3 102 103 104102 103 104102 103 104

| {z }

z }| {

1 2 31 2 31 2 3 102 103 104102 103 104102 103 104 5 6 7 85 6 7 85 6 7 8

z }| {

| {z }

1 2 31 2 31 2 3 102 103 104102 103 104102 103 104 5 6 7 85 6 7 85 6 7 8

| {z }

z }| {

Figure: Example of MPI_FILE_SEEK()

> mpiexec -n 2 seek

Read process 1 : 1, 2, 3, 102, 103, 104, 5, 6, 7, 8Read process 0 : 1, 2, 3, 102, 103, 104, 5, 6, 7, 8

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 34 / 55

Process 0

File 1 2 3 4 5 6 7 8 9 10 101 102 103 104 105 106 107 108 109 110

Process 1

1 2 31 2 31 2 3

z }| {

| {z }

1 2 31 2 31 2 3

| {z }

z }| {

1 2 31 2 31 2 3 102 103 104102 103 104102 103 104

z }| {

| {z }

1 2 31 2 31 2 3 102 103 104102 103 104102 103 104

| {z }

z }| {

1 2 31 2 31 2 3 102 103 104102 103 104102 103 104 5 6 7 85 6 7 85 6 7 8

z }| {

| {z }

1 2 31 2 31 2 3 102 103 104102 103 104102 103 104 5 6 7 85 6 7 85 6 7 8

| {z }

z }| {

Figure: Example of MPI_FILE_SEEK()

> mpiexec -n 2 seek

Read process 1 : 1, 2, 3, 102, 103, 104, 5, 6, 7, 8Read process 0 : 1, 2, 3, 102, 103, 104, 5, 6, 7, 8

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 34 / 55

Process 0

File 1 2 3 4 5 6 7 8 9 10 101 102 103 104 105 106 107 108 109 110

Process 1

1 2 31 2 31 2 3

z }| {

| {z }

1 2 31 2 31 2 3

| {z }

z }| {

1 2 31 2 31 2 3 102 103 104102 103 104102 103 104

z }| {

| {z }

1 2 31 2 31 2 3 102 103 104102 103 104102 103 104

| {z }

z }| {

1 2 31 2 31 2 3 102 103 104102 103 104102 103 104 5 6 7 85 6 7 85 6 7 8

z }| {

| {z }

1 2 31 2 31 2 3 102 103 104102 103 104102 103 104 5 6 7 85 6 7 85 6 7 8

| {z }

z }| {

Figure: Example of MPI_FILE_SEEK()

> mpiexec -n 2 seek

Read process 1 : 1, 2, 3, 102, 103, 104, 5, 6, 7, 8Read process 0 : 1, 2, 3, 102, 103, 104, 5, 6, 7, 8

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 34 / 55

Process 0

File 1 2 3 4 5 6 7 8 9 10 101 102 103 104 105 106 107 108 109 110

Process 1

1 2 31 2 31 2 3

z }| {

| {z }

1 2 31 2 31 2 3

| {z }

z }| {

1 2 31 2 31 2 3 102 103 104102 103 104102 103 104

z }| {

| {z }

1 2 31 2 31 2 3 102 103 104102 103 104102 103 104

| {z }

z }| {

1 2 31 2 31 2 3 102 103 104102 103 104102 103 104 5 6 7 85 6 7 85 6 7 8

z }| {

| {z }

1 2 31 2 31 2 3 102 103 104102 103 104102 103 104 5 6 7 85 6 7 85 6 7 8

| {z }

z }| {

Figure: Example of MPI_FILE_SEEK()

> mpiexec -n 2 seek

Read process 1 : 1, 2, 3, 102, 103, 104, 5, 6, 7, 8Read process 0 : 1, 2, 3, 102, 103, 104, 5, 6, 7, 8

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 34 / 55

Process 0

File 1 2 3 4 5 6 7 8 9 10 101 102 103 104 105 106 107 108 109 110

Process 1

1 2 31 2 31 2 3

z }| {

| {z }

1 2 31 2 31 2 3

| {z }

z }| {

1 2 31 2 31 2 3 102 103 104102 103 104102 103 104

z }| {

| {z }

1 2 31 2 31 2 3 102 103 104102 103 104102 103 104

| {z }

z }| {

1 2 31 2 31 2 3 102 103 104102 103 104102 103 104 5 6 7 85 6 7 85 6 7 8

z }| {

| {z }

1 2 31 2 31 2 3 102 103 104102 103 104102 103 104 5 6 7 85 6 7 85 6 7 8

| {z }

z }| {

Figure: Example of MPI_FILE_SEEK()

> mpiexec -n 2 seek

Read process 1 : 1, 2, 3, 102, 103, 104, 5, 6, 7, 8Read process 0 : 1, 2, 3, 102, 103, 104, 5, 6, 7, 8

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 34 / 55

Process 0

File 1 2 3 4 5 6 7 8 9 10 101 102 103 104 105 106 107 108 109 110

Process 1

1 2 31 2 31 2 3

z }| {

| {z }

1 2 31 2 31 2 3

| {z }

z }| {

1 2 31 2 31 2 3 102 103 104102 103 104102 103 104

z }| {

| {z }

1 2 31 2 31 2 3 102 103 104102 103 104102 103 104

| {z }

z }| {

1 2 31 2 31 2 3 102 103 104102 103 104102 103 104 5 6 7 85 6 7 85 6 7 8

z }| {

| {z }

1 2 31 2 31 2 3 102 103 104102 103 104102 103 104 5 6 7 85 6 7 85 6 7 8

| {z }

z }| {

Figure: Example of MPI_FILE_SEEK()

> mpiexec -n 2 seek

Read process 1 : 1, 2, 3, 102, 103, 104, 5, 6, 7, 8Read process 0 : 1, 2, 3, 102, 103, 104, 5, 6, 7, 8

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 34 / 55

Derived Datatypes

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 35 / 55

Derived Datatypes

Introduction

MPI communications consists of sending and receiving MPI data elements, suchas MPI_INTEGER, MPI_REAL, MPI_COMPLEX.

More general data structures can be defined using subroutines such asMPI_TYPE_VECTOR(), MPI_TYPE_INDEXED(),MPI_TYPE_CREATE_SUBARRAY() or MPI_TYPE_CREATE_STRUCT().

Derived datatypes provide a mechanism for working with noncontiguous orheterogeneous data. Using derived datatypes limits the number of MPI callsrequired for transferring complex data patterns.

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 36 / 55

Derived Datatypes

Subarray Datatype Constructor

MPI_TYPE_CREATE_SUBARRAY() creates a datatype describing a subarray of a fullarray.

Fortran 95 Arrays Vocabulary Reminder

The rank of an array is defined as the number of dimensions in the array.

The extent of an array in a given dimension is the number of index entriesavailable in that direction.

The shape of an array is a list of the number of elements in each dimension (ie: listof the extents).

Example: The rank of T(10,0:5,-10:10) is 3, its extent is 10 in the first dimension,6 in the second and 21 in the third; its shape is (10,6,21).

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 37 / 55

integer,intent(in) :: ndimsinteger,dimension(ndims),intent(in) :: array_of_sizes,array_of_subsizesinteger,dimension(ndims),intent(in) :: array_of_startsinteger,intent(in) :: order,oldtype

integer,intent(out) :: newtype,code

call MPI_TYPE_CREATE_SUBARRAY(ndims,array_of_sizes,array_of_subsizes,array_of_starts,order,oldtype,newtype,code)

Argument Description

ndims: Number of array dimensions (array rank).

array_of_sizes: Number of elements in each dimension of the full array (arrayshape).

array_of_subsizes: Number of elements in each dimension of the subarray(subarray shape).

array_of_starts: Starting coordinates of the subarray in each dimension.Arrays are assumed to be indexed starting from zero. In Fortran, to start asubarray at array(2,3), use array_of_starts=(/ 1,2 /).order: array storage order flag:

• MPI_ORDER_FORTRAN: Column-major order• MPI_ORDER_C: Row-major order

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 38 / 55

Derived Datatypes

Commit and Free

A derived datatype object has to be committed before it can be used in acommunication using the MPI_TYPE_COMMIT() subroutine.

integer, intent(inout) :: newtypeinteger, intent(out) :: code

call MPI_TYPE_COMMIT(newtype,code)

MPI_TYPE_FREE() frees the data types. Datatype names can be reusedafterwards.integer, intent(inout) :: newtypeinteger, intent(out) :: code

call MPI_TYPE_FREE(newtype,code)

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 39 / 55

Derived Datatypes

Figure: Exchanges between two processes

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 40 / 55

Derived Datatypes

1 program subarray2 use mpi3 implicit none45 integer,parameter :: nb_lines=4,nb_columns=3,&6 tag=1000,nb_dims=27 integer :: code,rank,type_subarray,i8 integer,dimension(nb_lines,nb_columns) :: tab9 integer,dimension(nb_dims) :: shape_array,shape_subarray,coord_start

10 integer,dimension(MPI_STATUS_SIZE) :: status1112 call MPI_INIT(code)13 call MPI_COMM_RANK(MPI_COMM_WORLD,rank,code)1415 ! Initialisation of the array on each process16 tab(:,:) = reshape( (/ (sign(i,-rank),i=1,nb_lines*nb_columns) /) , &17 (/ nb_lines,nb_columns /) )

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 41 / 55

Derived Datatypes

18 ! Shape of the full array19 shape_tab(:) = shape(tab)2021 ! The F95 shape routine returns the shape of an array.22 ! Warning: If the array is not allocated on every process, the array shape must be23 ! determined manually: shape_array(:) = (/ nb_lines,nb_columns) /)2425 ! Shape of the subarray26 shape_subarray(:) = (/ 2,2 /)2728 ! Starting coordinates of the subarray29 ! On process 0 we start at tab(2,1)30 ! On process 1 we start at tab(3,2)31 coord_start(:) = (/ rank+1,rank /)3233 ! Creation of the subarray datatype34 call MPI_TYPE_CREATE_SUBARRAY(nb_dims,shape_array,shape_subarray,coord_start,&35 MPI_ORDER_FORTRAN,MPI_INTEGER,type_subarray,code)36 call MPI_TYPE_COMMIT(type_subarray,code)3738 ! Subarray exchange39 call MPI_SENDRECV_REPLACE(tab,1,type_subarray,mod(rank+1,2),tag,&40 mod(rank+1,2),tag,MPI_COMM_WORLD,status,code)41 call MPI_TYPE_FREE(type_subarray,code)42 call MPI_FINALIZE(code)43 end program subarray

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 42 / 55

File Views

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 43 / 55

File Views

The View Mechanism

File Views is a mechanism which accesses data in a high-level way. A viewdescribes a template for accessing a file.

The view that a given process has of an open file is defined by three components:the elementary data type, file type and an initial displacement.

The view is determined by the repetition of the filetype pattern, beginning at thedisplacement.

etype

filetype| {z }

holes

file

initial displacement

· · ·

accessible data

| {z } | {z } | {z }

Figure: Tiling a file with a filetype

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 44 / 55

File Views

The View Mechanism

File Views is a mechanism which accesses data in a high-level way. A viewdescribes a template for accessing a file.

The view that a given process has of an open file is defined by three components:the elementary data type, file type and an initial displacement.

The view is determined by the repetition of the filetype pattern, beginning at thedisplacement.

etype

filetype| {z }

holes

file

initial displacement

· · ·

accessible data

| {z } | {z } | {z }

Figure: Tiling a file with a filetype

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 44 / 55

File Views

The View Mechanism

File Views is a mechanism which accesses data in a high-level way. A viewdescribes a template for accessing a file.

The view that a given process has of an open file is defined by three components:the elementary data type, file type and an initial displacement.

The view is determined by the repetition of the filetype pattern, beginning at thedisplacement.

etype

filetype| {z }

holes

file

initial displacement

· · ·

accessible data

| {z } | {z } | {z }

Figure: Tiling a file with a filetype

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 44 / 55

File Views

The View Mechanism

File Views is a mechanism which accesses data in a high-level way. A viewdescribes a template for accessing a file.

The view that a given process has of an open file is defined by three components:the elementary data type, file type and an initial displacement.

The view is determined by the repetition of the filetype pattern, beginning at thedisplacement.

etype

filetype| {z }

holes

file

initial displacement

· · ·

accessible data

| {z } | {z } | {z }

Figure: Tiling a file with a filetype

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 44 / 55

File Views

The View Mechanism

File Views is a mechanism which accesses data in a high-level way. A viewdescribes a template for accessing a file.

The view that a given process has of an open file is defined by three components:the elementary data type, file type and an initial displacement.

The view is determined by the repetition of the filetype pattern, beginning at thedisplacement.

etype

filetype| {z }

holes

file

initial displacement

· · ·

accessible data

| {z } | {z } | {z }

Figure: Tiling a file with a filetype

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 44 / 55

File Views

The View Mechanism

File Views is a mechanism which accesses data in a high-level way. A viewdescribes a template for accessing a file.

The view that a given process has of an open file is defined by three components:the elementary data type, file type and an initial displacement.

The view is determined by the repetition of the filetype pattern, beginning at thedisplacement.

etype

filetype| {z }

holes

file

initial displacement

· · ·

accessible data

| {z } | {z } | {z }

Figure: Tiling a file with a filetype

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 44 / 55

File Views

The View Mechanism

File Views are defined using MPI datatypes.

Derived datatypes can be used to structure accesses to the file. For example,elements can be skipped during data access.

The default view is a linear byte stream (displacement is zero, etype and filetypeequal to MPI_BYTE).

Multiple Views

Each process can successively use several views on the same file.

Each process can define its own view of the file and access complementary partsof it.

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 45 / 55

etype

filetype proc.0

filetype proc.1

filetype proc.2

file

initial displacement

· · ·

Figure: Separate views, each using a different filetype, can be used to access the file

Limitations:

Shared file pointer routines are not useable except when all the processes havethe same file view.

If the file is opened for writing, neither the etype nor the filetype is permitted tocontain overlapping regions.

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 46 / 55

etype

filetype proc.0

filetype proc.1

filetype proc.2

file

initial displacement

· · ·

Figure: Separate views, each using a different filetype, can be used to access the file

Limitations:

Shared file pointer routines are not useable except when all the processes havethe same file view.

If the file is opened for writing, neither the etype nor the filetype is permitted tocontain overlapping regions.

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 46 / 55

etype

filetype proc.0

filetype proc.1

filetype proc.2

file

initial displacement

· · ·

Figure: Separate views, each using a different filetype, can be used to access the file

Limitations:

Shared file pointer routines are not useable except when all the processes havethe same file view.

If the file is opened for writing, neither the etype nor the filetype is permitted tocontain overlapping regions.

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 46 / 55

etype

filetype proc.0

filetype proc.1

filetype proc.2

file

initial displacement

· · ·

Figure: Separate views, each using a different filetype, can be used to access the file

Limitations:

Shared file pointer routines are not useable except when all the processes havethe same file view.

If the file is opened for writing, neither the etype nor the filetype is permitted tocontain overlapping regions.

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 46 / 55

etype

filetype proc.0

filetype proc.1

filetype proc.2

file

initial displacement

· · ·

Figure: Separate views, each using a different filetype, can be used to access the file

Limitations:

Shared file pointer routines are not useable except when all the processes havethe same file view.

If the file is opened for writing, neither the etype nor the filetype is permitted tocontain overlapping regions.

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 46 / 55

etype

filetype proc.0

filetype proc.1

filetype proc.2

file

initial displacement

· · ·

Figure: Separate views, each using a different filetype, can be used to access the file

Limitations:

Shared file pointer routines are not useable except when all the processes havethe same file view.

If the file is opened for writing, neither the etype nor the filetype is permitted tocontain overlapping regions.

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 46 / 55

File Views

Changing the process’s view of the data in the file: MPI_FILE_SET_VIEW()

1 integer :: fh2 integer(kind=MPI_OFFSET_KIND) :: displacement3 integer :: etype4 integer :: filetype5 character(len=*) :: mode6 integer :: info7 integer :: code89 call MPI_FILE_SET_VIEW(fh,

10 displacement,etype,filetype,11 mode,info,code)

This operation is collective throughout the file handle.The values for the initial displacementand the filetype may vary between the processes in the group. The extents of elementarytypes must be identical.

In addition, the individual file pointers and the shared file pointer are reset to zero.

Notes :The datatypes passed in must have been committed using the MPI_TYPE_COMMIT()subroutine.

MPI defines three data representations (mode): "native", "internal" or "external32".

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 47 / 55

Example 1: Reading non-overlapping sequences of data segments in parallel

Process 0

File 1 2 3 4 5 6 7 8 9 10 101 102 103 104 105 106 107 108 109 110

Process 1

1 2

z }| {

| {z }

3 4

| {z }

z }| {

5 6

z }| {

| {z }

7 8

| {z }

z }| {

9 10

z }| {

| {z }

101 102

| {z }

z }| {

103 104

z }| {

| {z }

105 106

| {z }

z }| {

107 108

z }| {

| {z }

109 110

| {z }

z }| {

> mpiexec -n 2 read_view01

Read process 1 : 3, 4, 7, 8, 101, 102, 105, 106, 109, 110Read process 0 : 1, 2, 5, 6, 9, 10, 103, 104, 107, 108

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 48 / 55

Example 1: Reading non-overlapping sequences of data segments in parallel

Process 0

File 1 2 3 4 5 6 7 8 9 10 101 102 103 104 105 106 107 108 109 110

Process 1

1 2

z }| {

| {z }

3 4

| {z }

z }| {

5 6

z }| {

| {z }

7 8

| {z }

z }| {

9 10

z }| {

| {z }

101 102

| {z }

z }| {

103 104

z }| {

| {z }

105 106

| {z }

z }| {

107 108

z }| {

| {z }

109 110

| {z }

z }| {

> mpiexec -n 2 read_view01

Read process 1 : 3, 4, 7, 8, 101, 102, 105, 106, 109, 110Read process 0 : 1, 2, 5, 6, 9, 10, 103, 104, 107, 108

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 48 / 55

Example 1: Reading non-overlapping sequences of data segments in parallel

Process 0

File 1 2 3 4 5 6 7 8 9 10 101 102 103 104 105 106 107 108 109 110

Process 1

1 2

z }| {

| {z }

3 4

| {z }

z }| {

5 6

z }| {

| {z }

7 8

| {z }

z }| {

9 10

z }| {

| {z }

101 102

| {z }

z }| {

103 104

z }| {

| {z }

105 106

| {z }

z }| {

107 108

z }| {

| {z }

109 110

| {z }

z }| {

> mpiexec -n 2 read_view01

Read process 1 : 3, 4, 7, 8, 101, 102, 105, 106, 109, 110Read process 0 : 1, 2, 5, 6, 9, 10, 103, 104, 107, 108

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 48 / 55

Example 1: Reading non-overlapping sequences of data segments in parallel

Process 0

File 1 2 3 4 5 6 7 8 9 10 101 102 103 104 105 106 107 108 109 110

Process 1

1 2

z }| {

| {z }

3 4

| {z }

z }| {

5 6

z }| {

| {z }

7 8

| {z }

z }| {

9 10

z }| {

| {z }

101 102

| {z }

z }| {

103 104

z }| {

| {z }

105 106

| {z }

z }| {

107 108

z }| {

| {z }

109 110

| {z }

z }| {

> mpiexec -n 2 read_view01

Read process 1 : 3, 4, 7, 8, 101, 102, 105, 106, 109, 110Read process 0 : 1, 2, 5, 6, 9, 10, 103, 104, 107, 108

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 48 / 55

Example 1: Reading non-overlapping sequences of data segments in parallel

Process 0

File 1 2 3 4 5 6 7 8 9 10 101 102 103 104 105 106 107 108 109 110

Process 1

1 2

z }| {

| {z }

3 4

| {z }

z }| {

5 6

z }| {

| {z }

7 8

| {z }

z }| {

9 10

z }| {

| {z }

101 102

| {z }

z }| {

103 104

z }| {

| {z }

105 106

| {z }

z }| {

107 108

z }| {

| {z }

109 110

| {z }

z }| {

> mpiexec -n 2 read_view01

Read process 1 : 3, 4, 7, 8, 101, 102, 105, 106, 109, 110Read process 0 : 1, 2, 5, 6, 9, 10, 103, 104, 107, 108

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 48 / 55

Example 1: Reading non-overlapping sequences of data segments in parallel

Process 0

File 1 2 3 4 5 6 7 8 9 10 101 102 103 104 105 106 107 108 109 110

Process 1

1 2

z }| {

| {z }

3 4

| {z }

z }| {

5 6

z }| {

| {z }

7 8

| {z }

z }| {

9 10

z }| {

| {z }

101 102

| {z }

z }| {

103 104

z }| {

| {z }

105 106

| {z }

z }| {

107 108

z }| {

| {z }

109 110

| {z }

z }| {

> mpiexec -n 2 read_view01

Read process 1 : 3, 4, 7, 8, 101, 102, 105, 106, 109, 110Read process 0 : 1, 2, 5, 6, 9, 10, 103, 104, 107, 108

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 48 / 55

Example 1: Reading non-overlapping sequences of data segments in parallel

Process 0

File 1 2 3 4 5 6 7 8 9 10 101 102 103 104 105 106 107 108 109 110

Process 1

1 2

z }| {

| {z }

3 4

| {z }

z }| {

5 6

z }| {

| {z }

7 8

| {z }

z }| {

9 10

z }| {

| {z }

101 102

| {z }

z }| {

103 104

z }| {

| {z }

105 106

| {z }

z }| {

107 108

z }| {

| {z }

109 110

| {z }

z }| {

> mpiexec -n 2 read_view01

Read process 1 : 3, 4, 7, 8, 101, 102, 105, 106, 109, 110Read process 0 : 1, 2, 5, 6, 9, 10, 103, 104, 107, 108

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 48 / 55

Example 1: Reading non-overlapping sequences of data segments in parallel

Process 0

File 1 2 3 4 5 6 7 8 9 10 101 102 103 104 105 106 107 108 109 110

Process 1

1 2

z }| {

| {z }

3 4

| {z }

z }| {

5 6

z }| {

| {z }

7 8

| {z }

z }| {

9 10

z }| {

| {z }

101 102

| {z }

z }| {

103 104

z }| {

| {z }

105 106

| {z }

z }| {

107 108

z }| {

| {z }

109 110

| {z }

z }| {

> mpiexec -n 2 read_view01

Read process 1 : 3, 4, 7, 8, 101, 102, 105, 106, 109, 110Read process 0 : 1, 2, 5, 6, 9, 10, 103, 104, 107, 108

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 48 / 55

Example 1: Reading non-overlapping sequences of data segments in parallel

Process 0

File 1 2 3 4 5 6 7 8 9 10 101 102 103 104 105 106 107 108 109 110

Process 1

1 2

z }| {

| {z }

3 4

| {z }

z }| {

5 6

z }| {

| {z }

7 8

| {z }

z }| {

9 10

z }| {

| {z }

101 102

| {z }

z }| {

103 104

z }| {

| {z }

105 106

| {z }

z }| {

107 108

z }| {

| {z }

109 110

| {z }

z }| {

> mpiexec -n 2 read_view01

Read process 1 : 3, 4, 7, 8, 101, 102, 105, 106, 109, 110Read process 0 : 1, 2, 5, 6, 9, 10, 103, 104, 107, 108

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 48 / 55

Example 1: Reading non-overlapping sequences of data segments in parallel

Process 0

File 1 2 3 4 5 6 7 8 9 10 101 102 103 104 105 106 107 108 109 110

Process 1

1 2

z }| {

| {z }

3 4

| {z }

z }| {

5 6

z }| {

| {z }

7 8

| {z }

z }| {

9 10

z }| {

| {z }

101 102

| {z }

z }| {

103 104

z }| {

| {z }

105 106

| {z }

z }| {

107 108

z }| {

| {z }

109 110

| {z }

z }| {

> mpiexec -n 2 read_view01

Read process 1 : 3, 4, 7, 8, 101, 102, 105, 106, 109, 110Read process 0 : 1, 2, 5, 6, 9, 10, 103, 104, 107, 108

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 48 / 55

Example 1: Reading non-overlapping sequences of data segments in parallel

Process 0

File 1 2 3 4 5 6 7 8 9 10 101 102 103 104 105 106 107 108 109 110

Process 1

1 2

z }| {

| {z }

3 4

| {z }

z }| {

5 6

z }| {

| {z }

7 8

| {z }

z }| {

9 10

z }| {

| {z }

101 102

| {z }

z }| {

103 104

z }| {

| {z }

105 106

| {z }

z }| {

107 108

z }| {

| {z }

109 110

| {z }

z }| {

> mpiexec -n 2 read_view01

Read process 1 : 3, 4, 7, 8, 101, 102, 105, 106, 109, 110Read process 0 : 1, 2, 5, 6, 9, 10, 103, 104, 107, 108

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 48 / 55

File Views

Example 1 (continued)

init_disp 0

etype MPI_INTEGER

filetype proc. 0

filetype proc. 1

1 if (rank == 0) coord=12 if (rank == 1) coord=334 call MPI_TYPE_CREATE_SUBARRAY(1,(/4/),(/2/),(/coord - 1/), &5 MPI_ORDER_FORTRAN,MPI_INTEGER,filetype,code)6 call MPI_TYPE_COMMIT(filetype,code)78 ! Using an intermediate variable for portability reasons9 init_displacement=0

1011 call MPI_FILE_SET_VIEW(handle,init_displacement,MPI_INTEGER,filetype, &12 "native",MPI_INFO_NULL,code)

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 49 / 55

File Views

1 program read_view012 use mpi3 implicit none4 integer, parameter :: nb_values=105 integer :: rank,handle,coord,filetype,code6 integer(kind=MPI_OFFSET_KIND) :: init_displacement7 integer, dimension(nb_values) :: values8 integer, dimension(MPI_STATUS_SIZE) :: status9

10 call MPI_INIT(code)11 call MPI_COMM_RANK(MPI_COMM_WORLD,rank,code)1213 if (rank == 0) coord=114 if (rank == 1) coord=31516 call MPI_TYPE_CREATE_SUBARRAY(1,(/4/),(/2/),(/coord - 1/), &17 MPI_ORDER_FORTRAN,MPI_INTEGER,filetype,code)18 call MPI_TYPE_COMMIT(filetype,code)1920 call MPI_FILE_OPEN(MPI_COMM_WORLD,"data.dat",MPI_MODE_RDONLY,MPI_INFO_NULL, &21 handle,code)2223 init_displacement=024 call MPI_FILE_SET_VIEW(handle,init_displacement,MPI_INTEGER,filetype, &25 "native",MPI_INFO_NULL,code)26 call MPI_FILE_READ_ALL(handle,values,nb_values,MPI_INTEGER,status,code)2728 print *, "Read process",rank,":",values(:)2930 call MPI_FILE_CLOSE(handle,code)31 call MPI_FINALIZE(code)3233 end program read_view01

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 50 / 55

Conclusion

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 51 / 55

Conclusion

Conclusion

MPI-IO provides a high-level I/O interface and a rich set of functionalities. Complexoperations can be performed easily using an MPI-like interface and MPI librairiesprovide suitable optimisations. MPI-IO also achieves some portability.

Advices

Avoid subroutines with explicit positioning and prefer the use of shared orindividual pointers as they provide a higher-level interface.

Take advantage of collective I/O operations as they are generally more efficient.

Use asynchronous I/O only after getting correct behaviour on a blocking version.

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 52 / 55

Annexes

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 53 / 55

Annexes

Definitions (files)

file : An MPI file is an ordered collection of typed data items. A file is openedcollectively by a group of processes. All collective I/O calls on a file arecollective over this group.

file handle : A file handle is an opaque object created by MPI_FILE_OPEN() andfreed by MPI_FILE_CLOSE(). All operations on an open file referencethe file through the file handle.

file pointer : A file pointer is an implicit offset maintained by MPI.

offset : An offset is a position in the file relative to the current view, expressedas a count of etypes. Holes in the view’s filetype are skipped whencalculating this position.

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 54 / 55

Annexes

Definitions (views)

displacement : A file displacement is an absolute byte position relative to thebeginning of a file. The displacement defines the location where a viewbegins.

etype : An etype (elementary datatype) is the unit of data access andpositioning. It can be any MPI predefined or derived datatype. Dataaccess is performed in etype units, reading or writing whole data itemsof type etype. Offsets are expressed as a count of etypes.

filetype : A filetype is the basis for partitioning a file among processes anddefines a template for accessing the file. A filetype is either a singleetype or a derived MPI datatype constructed from multiple instances ofthe same etype. In addition, the extent of any hole in the filetype mustbe a multiple of the etype’s extent.

view : A view defines the current set of data visible and accessible from anopen file as an ordered set of etypes.

Philippe WAUTELET (CNRS/IDRIS) MPI-IO March 5th 2015 55 / 55