View
241
Download
0
Category
Preview:
Citation preview
8/3/2019 General Parallel File System 13
1/23
IBM Research Lab in HaifaIBM Research Lab in Haifa
Architectural and Design Issues inArchitectural and Design Issues in
the General Parallel File Systemthe General Parallel File System
Benny MandlerBenny Mandler -- mandler@il.ibm.commandler@il.ibm.com
May 12, 2002
8/3/2019 General Parallel File System 13
2/23
H
R
L
AgendaAgenda
What is GPFS?
? a file system for deep computing
GPFS uses
General architecture
How does GPFS meet its challenges - architectural issues
? performance
? scalability
? high availability
? concurrency control
8/3/2019 General Parallel File System 13
3/23
H
R
L
RS/6000 SP Scalable Parallel ComputerRS/6000 SP Scalable Parallel Computer
11--512 nodes connected by high512 nodes connected by high--speed switchspeed switch
11--16 CPUs per node (Power2 or PowerPC)16 CPUs per node (Power2 or PowerPC)
>1 TB disk per node>1 TB disk per node
500 MB/s full duplex per switch port500 MB/s full duplex per switch port
Scalable parallel computing enables I/OScalable parallel computing enables I/O--intensive applicationsintensive applications::
Deep computingDeep computing -- simulation, seismic analysis, data miningsimulation, seismic analysis, data mining
Server consolidationServer consolidation -- aggregating file, web servers onto a centrallyaggregating file, web servers onto a centrally--managed machinemanaged machine
Streaming video and audio for multimedia presentationStreaming video and audio for multimedia presentation
Scalable object store for large digital libraries, web servers, databases, ...Scalable object store for large digital libraries, web servers, databases, ...
Scalable Parallel ComputingScalable Parallel Computing
What is GPFS?
8/3/2019 General Parallel File System 13
4/23
H
R
L
High PerformanceHigh Performance -- multiple GB/s to/from a single filemultiple GB/s to/from a single file
concurrentconcurrent reads and writes, parallel data accessreads and writes, parallel data access -- within a file and across fileswithin a file and across files
Support fully parallel access both to file data and metadataSupport fully parallel access both to file data and metadata
client cachingclient caching enabled by distributed lockingenabled by distributed locking
wide striping, large data blocks, prefetchwide striping, large data blocks, prefetch
ScalabilityScalability
scales up to 512 nodes (Nscales up to 512 nodes (N--Way SMP). Storage nodes, file system nodes, disks,Way SMP). Storage nodes, file system nodes, disks,
adapters...adapters...
High AvailabilityHigh Availability
faultfault--tolerancetolerance via logging, replication, RAID supportvia logging, replication, RAID support
survives node and disk failuressurvives node and disk failures
Uniform accessUniform access via shared disksvia shared disks -- Single image file systemSingle image file system
High capacityHigh capacity multiple TB per file system, 100s of GB per file.multiple TB per file system, 100s of GB per file.
Standards compliantStandards compliant (X/Open 4.0 "POSIX") with minor exceptions(X/Open 4.0 "POSIX") with minor exceptions
GPFS addresses SP I/O requirementsGPFS addresses SP I/O requirements
What is GPFS?
8/3/2019 General Parallel File System 13
5/23
H
R
L
Native AIX File System (JFS)
No file sharing - application can only accessfiles on its own node
Applications must do their own data partitioning
DCE Distributed File System (follow-up of AFS)
Application nodes (DCE clients) share files on server node
Switch is used as a fast LAN
Coarse-grained (file or segment level) parallelism
Server node is performance and capacity bottleneck
GPFS Parallel File SystemGPFS Parallel File System
GPFS file systems are striped acrossGPFS file systems are striped across
multiple disks on multiple storage nodesmultiple disks on multiple storage nodes
Independent GPFS instances run on eachIndependent GPFS instances run on each
application nodeapplication node
GPFS instances use storage nodes as "blockGPFS instances use storage nodes as "block
servers"servers" -- all instances can access all disksall instances can access all disks
GPFS vs. local and distributed file systems on the SP2GPFS vs. local and distributed file systems on the SP2
8/3/2019 General Parallel File System 13
6/23
H
R
L
Video on Demand for new "borough" of TokyoVideo on Demand for new "borough" of TokyoApplications: movies, news, karaoke, education ...Applications: movies, news, karaoke, education ...
Video distribution via hybrid fiber/coaxVideo distribution via hybrid fiber/coax
Trial "live" since June '96Trial "live" since June '96
Currently 500 subscribersCurrently 500 subscribers
6 Mbit/sec MPEG video streams6 Mbit/sec MPEG video streams
100 simultaneous viewers (75 MB/sec)100 simultaneous viewers (75 MB/sec)
200 hours of video on line (700 GB)200 hours of video on line (700 GB)
1212--node SPnode SP--2 (7 distribution, 5 storage)2 (7 distribution, 5 storage)
Tokyo Video on Demand TrialTokyo Video on Demand Trial
8/3/2019 General Parallel File System 13
7/23
H
R
L
Major aircraft manufacturer
Using CATIA for large designs, Elfini for structural modeling and analysis
SP used for modeling/analysis
Using GPFS to store CATIA designs and structural modeling data
GPFS allows all nodes to share designs and models
Engineering DesignEngineering Design
GPFS uses
8/3/2019 General Parallel File System 13
8/23
H
R
L
File systems consist of one or more shared disks
? Individual disk can contain data, metadata, or both
? Disks are designated to failure group
? Data and metadata are striped to balance load and
maximize parallelism
Recoverable Virtual Shared Disk for accessing diskstorage
? Disks are physically attached to SP nodes
? VSD allows clients to access disks over the SP switch
? VSDclient looks like disk device driver on client node
? VSDserverexecutes I/O requests on storage node.? VSD supports JBOD or RAID volumes, fencing, multi-
pathing (where physical hardware permits)
GPFS only assumes a conventional block I/O
interface
Shared DisksShared Disks -- Virtual Shared Disk architectureVirtual Shared Disk architecture
General architecture
8/3/2019 General Parallel File System 13
9/23
H
R
L
Implications of Shared Disk Model
?All data and metadata on globally accessible disks (VSD)
?All access to permanent data through disk I/O interface
? Distributed protocols, e.g., distributed locking, coordinate disk access from
multiple nodes
? Fine-grained locking allows parallel access by multiple clients? Logging and Shadowing restore consistency after node failures
Implications of Large Scale
? Support up to 4096 disks of up to 1 TB each (4 Petabytes)
The largest system in production is 75 TB
? Failure detection and recovery protocols to handle node failures
? Replication and/or RAID protect against disk / storage node failure
? On-line dynamic reconfiguration (add, delete, replace disks and nodes;
rebalance file system)
GPFS Architecture OverviewGPFS Architecture Overview
General architecture
8/3/2019 General Parallel File System 13
10/23
H
R
L
Three types of nodes: file system, storage, and manager? Each node can perform any of these functions
? File system nodes
run user programs, read/write data to/from storage nodes
implement virtual file system interface
cooperate with manager nodes to perform metadata operations
? Manager nodes (one per file system)global lock manager
recovery manager
global allocation manager
quota manager
file metadata manager
admin services fail over? Storage nodes
implement block I/O interface
shared access from file system and manager nodes
interact with manager nodes for recovery (e.g. fencing)
file data and metadata striped across multiple disks on multiple storage nodes
GPFS ArchitectureGPFS Architecture -- Node RolesNode Roles
General architecture
8/3/2019 General Parallel File System 13
11/23
H
R
L
GPFS Software StructureGPFS Software Structure
General architecture
8/3/2019 General Parallel File System 13
12/23
H
R
L
Large block size allows efficient use of disk bandwidth
Fragments reduce space overhead for small files
No designated "mirror", no fixed placement function:
Flexible replication (e.g., replicate only metadata, or only important files)
Dynamic reconfiguration: data can migrate block-by-block
Multi level indirect blocks? Each disk address:
list of pointers to replicas
? Each pointer:
disk id + sector no.
Disk Data Structures: FilesDisk Data Structures: Files
General architecture
8/3/2019 General Parallel File System 13
13/23
H
R
L
Conventional file systems store data in small blocks to pack data
more densely
GPFS uses large blocks (256KB default) to optimize disk transfer
speed
Large File Block SizeLarge File Block Size
Performance
0 128 256 384 512 640 768 896 1024
I/O Transfer Size (Kbytes)
0
1
2
3
4
5
6
7
Throughput(MB/sec)
8/3/2019 General Parallel File System 13
14/23
H
R
L
Parallelism and consistencyParallelism and consistency
Distributed locking - acquire appropriate lock for every operation -
used for updates to user data
Centralized management - conflicting operations forwarded to a
designated node - used for file metadata
Distributed locking + centralized hints - used for space allocation
Central coordinator - used for configuration changes
I/O slowdown effects
Additional I/O activity
rather than token server
overload
8/3/2019 General Parallel File System 13
15/23
H
R
L
GPFS allows parallel applications on multiple nodes to access non-
overlapping ranges of a single file with no conflict
Global locking serializes access to overlapping ranges of a file
Global locking based on "tokens" which convey access rights to an
object (e.g. a file) or subset of an object (e.g. a byte range)
Tokens can be held across file system operations, enabling coherent
data caching in clients
Cached data discarded or written to disk when token is revoked
Performance optimizations: required/desired ranges, metanode, data
shipping, special token modes for file size operations
Parallel File Access From Multiple NodesParallel File Access From Multiple Nodes
Performance
8/3/2019 General Parallel File System 13
16/23
H
R
L
GPFS stripes successive blocks across successive disks
Disk I/O for sequential reads and writes is done in parallel
GPFS measures application "think time" ,disk throughput, and cache
state to automatically determine optimal parallelism
Prefetch algorithms now recognize strided
and reverse sequential access
Accepts hints
Write-behind policy
Application reads at 15 MB/sec
Each disk reads at 5 MB/secThree I/Os executed in parallel
Deep Prefetch for High ThroughputDeep Prefetch for High Throughput
Performance
8/3/2019 General Parallel File System 13
17/23
H
R
L
Hardware: Power2 wide
nodes, SSA disks
Experiment: sequential
read/write from large
number of GPFS nodes to
varying number of storage
nodesResult: throughput
increases nearly linearly
with number of storage
nodes
Bottlenecks:
? microchannel limits node
throughput to 50MB/s
? system throughput limited by
available storage nodes
GPFS Throughput Scaling forNonGPFS Throughput Scaling forNon--cached Filescached Files
Scalability
8/3/2019 General Parallel File System 13
18/23
H
R
L
Segmented Block Allocation
MAP:
Each segment contains bits representing
blocks on all disks
Each segment is a separately lockable unit
Minimizes contention for allocation map
when writing files on multiple nodes
Allocation managerservice provides hints
which segments to try
Similar: inode allocation map
Disk Data Structures: Allocation mapDisk Data Structures: Allocation map
Scalability
8/3/2019 General Parallel File System 13
19/23
H
R
L
Problem: detect/fix file system inconsistencies after a failure of one
or more nodes
?All updates that may leave inconsistencies if uncompleted are logged
?Write-ahead logging policy: log record is forced to disk before dirty metadata is
written
? Redo log: replaying all log records at recovery time restores file systemconsistency
Logged updates:
? I/O to replicated data
? directory operations (create, delete, move, ...)
? allocation map changes
Other techniques:
? ordered writes
? shadowing
High AvailabilityHigh Availability -- Logging and RecoveryLogging and Recovery
High Availability
8/3/2019 General Parallel File System 13
20/23
H
R
L
Application node failure:
? force-on-steal policy ensures that all changes visible to other nodes have been
written to disk and will not be lost
? all potential inconsistencies are protected by a token andare logged
? file system manager runs log recovery on behalf of the failed node
after successful log recovery tokens held by the failed node are released? actions taken: restore metadata being updated by the failed node to a
consistent state, release resources held by the failed node
File system manager failure:
? new node is appointed to take over
?
new file system manager restores volatile state by querying other nodes? New file system manager may have to undo or finish a partially completed
configuration change (e.g., add/delete disk)
Storage node failure:
? Dual-attached disk: use alternate path (VSD)
? Single attached disk: treat as disk failure
Node Failure RecoveryNode Failure Recovery
High Availability
8/3/2019 General Parallel File System 13
21/23
H
R
L
When a disk failure is detected
? The node that detects the failure informs the file system manager
? File system manager updates the configuration data to mark the failed disk as
"down" (quorum algorithm)
While a disk is down
? Read one / write all available copies? "Missing update" bit set in the inode of modified files
When/if disk recovers
? File system manager searches inode file for missing update bits
?All data & metadata of files with missing updates are copied back to the
recovering disk (one file at a time, normal locking protocol)? Until missing update recovery is complete, data on the recovering disk is
treated as write-only
Unrecoverable disk failure
? Failed disk is deleted from configuration or replaced by a new one
? New replicas are created on the replacement or on other disks
Handling Disk FailuresHandling Disk Failures
8/3/2019 General Parallel File System 13
22/23
H
R
L
CacheManagementCacheManagement
Total CacheGeneral Pool: Clock list, merge, re-map
Block Size pool: Clock list
Block Size pool: Clock list
Block Size pool: Clock list
StatsSeq / random
optimal, totalSeq / random
optimal, totalSeq / random
optimal, totalSeq / random
optimal, total
Balance dynamically according to usage patterns
Avoid fragmentation - internal and externalUnified steal
Periodical re-balancing
8/3/2019 General Parallel File System 13
23/23
H
R
L
EpilogueEpilogue
Used on six of the ten most powerful supercomputers in the world,
including the largest (ASCI white)
Installed at several hundred customer sites, on clusters ranging from
a few nodes with less than a TB of disk, up to 512 nodes with 140 TB
of disk in 2 file systems
IP rich - ~20 filed patentsState of the art
TeraSort
? world record of 17 minutes
? using 488 node SP. 432 file system and 56 storage nodes (604e 332 MHz)
? total 6 TB disk spaceReferences
? GPFS home page: http://www.haifa.il.ibm.com/projects/storage/gpfs.html
? FAST 2002: http://www.usenix.org/events/fast/schmuck.html
? TeraSort - http://www.almaden.ibm.com/cs/gpfs-spsort.html
? Tiger Shark: http://www.research.ibm.com/journal/rd/422/haskin.html
Recommended