Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
© 2009 IBM Corporation
pNFS, POSIX, and MPI-IO: A Tale of Three Semantics
Dean Hildebrand, Roger Haskin ― IBM AlmadenArifa Nisar ― Northwestern University
Dean Hildebrand – Research Staff Member
PDSW 2009
© 2009 IBM Corporation2
PDSW’09
Agenda
Motivation
pNFS
HPC consistency requirements
Protocol consistency semantics
NFSv3 ADIO driver
pNFS ADIO driver
© 2009 IBM Corporation3
PDSW’09
Motivation: Commodity parallel file system clients
� Supercomputers can be connected with multiple parallel file systems– GPFS, PVFS2, PanFS, Lustre
� Want single file system client to access all available storage systems
Storage Systems
StorageSupercomputersI/O Nodes
File System Clients
© 2009 IBM Corporation4
PDSW’09
Agenda
Motivation
pNFS
HPC consistency requirements
Protocol consistency semantics
NFSv3 ADIO driver
pNFS ADIO driver
© 2009 IBM Corporation5
PDSW’09
pNFS
pNFS Client
Layout Driver
I/O & Policy API
pNFS Server
Parallel File System
VFS API
Client
Data Servers
State Server
Parallel I/O
Metadata
Management Protocol
NFSv4.1/pNFS now has:
� Integrated locking protocol
� Open/Close with per-object change attribute
© 2009 IBM Corporation6
PDSW’09
pNFS with GPFS
StorageGPFS NSD Servers Or SAN RAID controllers
File-based NFSv4.1 Clients GPFS Data and State
Servers
� Fully-symmetric GPFS architecture - scalable data and metadata– pNFS client can mount and retrieve layout from any GPFS node– metadata requests can be load balanced across cluster
� pNFS server and native GPFS clients can share the same file system– Backup, dedup, and other mgmt functions don’t need to be done over NFS
� Need robust interface between NFSD and GPFS
AIX
Solaris
Linux
© 2009 IBM Corporation7
PDSW’09
Agenda
Motivation
pNFS
HPC consistency requirements
Protocol consistency semantics
NFSv3 ADIO driver
pNFS ADIO driver
© 2009 IBM Corporation8
PDSW’09
HPC consistency requirements
� Different I/O workloads have different requirements
– Checkpoint• Write-only• No revalidation to new file
– Ingest/Restart• Read-only• Revalidation on Open
– sync-barrier-sync• Sync data to disk• Force revalidate/invalidate
– Others?
© 2009 IBM Corporation9
PDSW’09
MPI-IO: sync-barrier-sync
� Applications sometimes need to share computed results among processes/nodes
� Want to avoid MPI atomic mode to improve performance– Requires enforcing strict consistency semantics– Writes must be immediately visible by other processes
� Allows compute nodes to synchronize I/O operations between themselves– Sync #1 guarantees that the data written by all nodes is transferred to storage.– Barrier ensures that writes on all nodes complete prior to reads.– Sync #2 guarantees that all transferred data is visible to all processes.
© 2009 IBM Corporation10
PDSW’09
Sync-Barrier-Sync
Compute Nodes Storage SystemNetwork
1
2
3
Each node has data in its cache
© 2009 IBM Corporation11
PDSW’09
Sync-Barrier-Sync
1. SYNC: Place written data on filing servers
Compute Nodes Storage SystemNetwork
1
2
3
© 2009 IBM Corporation12
PDSW’09
Sync-Barrier-Sync
2. BARRIER: Clients wait for other clients to flush dirty data to the servers, ensuring that no client issues read requests until all clients have the same view of file contents.
Compute Nodes Storage SystemNetwork
1
2
3
© 2009 IBM Corporation13
PDSW’09
Sync-Barrier-Sync
3. SYNC: Perform file revalidation by ensuring written data is visible to all nodes.
Compute Nodes Storage SystemNetwork
1
2
3
© 2009 IBM Corporation14
PDSW’09
Sync-Barrier-Sync
Nodes read new data
Compute Nodes Storage SystemNetwork
1
2
3
© 2009 IBM Corporation15
PDSW’09
Agenda
Motivation
pNFS
HPC consistency requirements
Protocol consistency semantics
NFSv3 ADIO driver
pNFS ADIO driver
© 2009 IBM Corporation16
PDSW’09
Protocol consistency semantics
CloseFnctl ulockFsync
OpenFnctl lock“____“
pNFSClose-to-open
MPI_File_SyncMPI_File_Close
MPI_File_SyncMPI_File_Open
MPI-IO (non-atomic)
Fsync
Close
Open
Read
POSIXLast writer wins
Data FlushRevalidation
Notes:
� POSIX does not require revalidation primitive
� NFS lacks primitive to leverage per-object change attribute
© 2009 IBM Corporation17
PDSW’09
Agenda
Motivation
pNFS
HPC consistency requirements
Protocol consistency semantics
NFSv3 ADIO driver
pNFS ADIO driver
© 2009 IBM Corporation18
PDSW’09
NFSv3 ADIO driver
ROMIO/ADIO� ROMIO is an I/O implementation of MPI-IO
� ADIO is a portable parallel I/O API that allows file systems to implement MPI-IO semantics
� NFSv3 ADIO driver– Forcing NFSv3 to comply with MPI-IO semantics hurts performance– Performs multiple close/open or lock/locku to revalidate file
• Inconsistent NFSv3 implementations• Protocol and implementation problems with the NFSv3 lockd daemon.• Disallows attribute caching
� UFS ADIO driver– POSIX compliant file systems
© 2009 IBM Corporation19
PDSW’09
POP-IO NFSv4/Ext3 Read and Write
0
10
20
30
40
50
60
70
80
90
100
2 4 16 32
Number of processes
I/O B
and
wid
th (
MB
/sec
)
Read UFS DriverRead NFS DriverWrite UFS DriverWrite NFS Driver
© 2009 IBM Corporation20
PDSW’09
POP-IO pNFS/GPFS Read and Write
0
10
20
30
40
50
60
70
80
90
100
2 4 16 32
Number of processes
Ag
gre
gat
e I/O
Ban
dw
idth
(M
B/s
ec)
Read UFS Driver
Read NFS driver
0
10
20
30
40
50
60
70
80
90
100
2 4 16 32Number of processes
Agg
rega
te I/
O B
andw
idth
(M
B/s
ec)
Write UFS DriverWrite NFS Driver
READ WRITE
© 2009 IBM Corporation21
PDSW’09
Agenda
Motivation
pNFS
HPC consistency requirements
Protocol consistency semantics
NFSv3 ADIO driver
pNFS ADIO driver
© 2009 IBM Corporation22
PDSW’09
Looking forward: pNFS ADIO driver possibilities
� Some I/O workloads, e.g., checkpointing, require minimal data coherence– Try to optimize “all read” or “all write” workloads
� For applications that perform sync-barrier-sync:o HEC POSIX extensions for HPC
o O_LAZY, lazyio_propagate(), lazyio_synchronize()o Direct I/O
o No read or writeback cacheo Possibly only on read path
o Non-portable techniques:o Manually invalidate entire page cacheo Fadvise
o open/close and/or lock/ulocko Data must be written to disk
o User-space client with customized interfaceo Support?
o Others?
© 2009 IBM Corporation23
PDSW’09
Summary
Good: MPI-IO and pNFS share similar relaxed semantics
Bad: HPC apps require on-demand file sync and revalidation (Lazy I/O)
Ugly: Interface to file system through POSIX interface
Lacks on-demand file revalidation
Need to investigate possible workarounds
© 2009 IBM Corporation24
PDSW’09
Thank You!