General Parallel File System 13

8/3/2019 General Parallel File System 13

1/23

IBM Research Lab in HaifaIBM Research Lab in Haifa

Architectural and Design Issues inArchitectural and Design Issues in

the General Parallel File Systemthe General Parallel File System

Benny MandlerBenny Mandler -- [email protected]@il.ibm.com

May 12, 2002


2/23

H

R

L

AgendaAgenda

What is GPFS?

? a file system for deep computing

GPFS uses

General architecture

How does GPFS meet its challenges - architectural issues

? performance

? scalability

? high availability

? concurrency control


3/23

H

R

L

RS/6000 SP Scalable Parallel ComputerRS/6000 SP Scalable Parallel Computer

11--512 nodes connected by high512 nodes connected by high--speed switchspeed switch

11--16 CPUs per node (Power2 or PowerPC)16 CPUs per node (Power2 or PowerPC)

>1 TB disk per node>1 TB disk per node

500 MB/s full duplex per switch port500 MB/s full duplex per switch port

Scalable parallel computing enables I/OScalable parallel computing enables I/O--intensive applicationsintensive applications::

Deep computingDeep computing -- simulation, seismic analysis, data miningsimulation, seismic analysis, data mining

Server consolidationServer consolidation -- aggregating file, web servers onto a centrallyaggregating file, web servers onto a centrally--managed machinemanaged machine

Streaming video and audio for multimedia presentationStreaming video and audio for multimedia presentation

Scalable object store for large digital libraries, web servers, databases, ...Scalable object store for large digital libraries, web servers, databases, ...

Scalable Parallel ComputingScalable Parallel Computing

What is GPFS?


4/23

H

R

L

High PerformanceHigh Performance -- multiple GB/s to/from a single filemultiple GB/s to/from a single file

concurrentconcurrent reads and writes, parallel data accessreads and writes, parallel data access -- within a file and across fileswithin a file and across files

Support fully parallel access both to file data and metadataSupport fully parallel access both to file data and metadata

client cachingclient caching enabled by distributed lockingenabled by distributed locking

wide striping, large data blocks, prefetchwide striping, large data blocks, prefetch

ScalabilityScalability

scales up to 512 nodes (Nscales up to 512 nodes (N--Way SMP). Storage nodes, file system nodes, disks,Way SMP). Storage nodes, file system nodes, disks,

adapters...adapters...

High AvailabilityHigh Availability

faultfault--tolerancetolerance via logging, replication, RAID supportvia logging, replication, RAID support

survives node and disk failuressurvives node and disk failures

Uniform accessUniform access via shared disksvia shared disks -- Single image file systemSingle image file system

High capacityHigh capacity multiple TB per file system, 100s of GB per file.multiple TB per file system, 100s of GB per file.

Standards compliantStandards compliant (X/Open 4.0 "POSIX") with minor exceptions(X/Open 4.0 "POSIX") with minor exceptions

GPFS addresses SP I/O requirementsGPFS addresses SP I/O requirements

What is GPFS?


5/23

H

R

L

Native AIX File System (JFS)

No file sharing - application can only accessfiles on its own node

Applications must do their own data partitioning

DCE Distributed File System (follow-up of AFS)

Application nodes (DCE clients) share files on server node

Switch is used as a fast LAN

Coarse-grained (file or segment level) parallelism

Server node is performance and capacity bottleneck

GPFS Parallel File SystemGPFS Parallel File System

GPFS file systems are striped acrossGPFS file systems are striped across

multiple disks on multiple storage nodesmultiple disks on multiple storage nodes

Independent GPFS instances run on eachIndependent GPFS instances run on each

application nodeapplication node

GPFS instances use storage nodes as "blockGPFS instances use storage nodes as "block

servers"servers" -- all instances can access all disksall instances can access all disks

GPFS vs. local and distributed file systems on the SP2GPFS vs. local and distributed file systems on the SP2


6/23

H

R

L

Video on Demand for new "borough" of TokyoVideo on Demand for new "borough" of TokyoApplications: movies, news, karaoke, education ...Applications: movies, news, karaoke, education ...

Video distribution via hybrid fiber/coaxVideo distribution via hybrid fiber/coax

Trial "live" since June '96Trial "live" since June '96

Currently 500 subscribersCurrently 500 subscribers

6 Mbit/sec MPEG video streams6 Mbit/sec MPEG video streams

100 simultaneous viewers (75 MB/sec)100 simultaneous viewers (75 MB/sec)

200 hours of video on line (700 GB)200 hours of video on line (700 GB)

1212--node SPnode SP--2 (7 distribution, 5 storage)2 (7 distribution, 5 storage)

Tokyo Video on Demand TrialTokyo Video on Demand Trial


7/23

H

R

L

Major aircraft manufacturer

Using CATIA for large designs, Elfini for structural modeling and analysis

SP used for modeling/analysis

Using GPFS to store CATIA designs and structural modeling data

GPFS allows all nodes to share designs and models

Engineering DesignEngineering Design

GPFS uses


8/23

H

R

L

File systems consist of one or more shared disks

? Individual disk can contain data, metadata, or both

? Disks are designated to failure group

? Data and metadata are striped to balance load and

maximize parallelism

Recoverable Virtual Shared Disk for accessing diskstorage

? Disks are physically attached to SP nodes

? VSD allows clients to access disks over the SP switch

? VSDclient looks like disk device driver on client node

? VSDserverexecutes I/O requests on storage node.? VSD supports JBOD or RAID volumes, fencing, multi-

pathing (where physical hardware permits)

GPFS only assumes a conventional block I/O

interface

Shared DisksShared Disks -- Virtual Shared Disk architectureVirtual Shared Disk architecture



9/23

H

R

L

Implications of Shared Disk Model

?All data and metadata on globally accessible disks (VSD)

?All access to permanent data through disk I/O interface

? Distributed protocols, e.g., distributed locking, coordinate disk access from

multiple nodes

? Fine-grained locking allows parallel access by multiple clients? Logging and Shadowing restore consistency after node failures

Implications of Large Scale

? Support up to 4096 disks of up to 1 TB each (4 Petabytes)

The largest system in production is 75 TB

? Failure detection and recovery protocols to handle node failures

? Replication and/or RAID protect against disk / storage node failure

? On-line dynamic reconfiguration (add, delete, replace disks and nodes;

rebalance file system)

GPFS Architecture OverviewGPFS Architecture Overview



10/23

H

R

L

Three types of nodes: file system, storage, and manager? Each node can perform any of these functions

? File system nodes

run user programs, read/write data to/from storage nodes

implement virtual file system interface

cooperate with manager nodes to perform metadata operations

? Manager nodes (one per file system)global lock manager

recovery manager

global allocation manager

quota manager

file metadata manager

admin services fail over? Storage nodes

implement block I/O interface

shared access from file system and manager nodes

interact with manager nodes for recovery (e.g. fencing)

file data and metadata striped across multiple disks on multiple storage nodes

GPFS ArchitectureGPFS Architecture -- Node RolesNode Roles



11/23

H

R

L

GPFS Software StructureGPFS Software Structure



12/23

H

R

L

Large block size allows efficient use of disk bandwidth

Fragments reduce space overhead for small files

No designated "mirror", no fixed placement function:

Flexible replication (e.g., replicate only metadata, or only important files)

Dynamic reconfiguration: data can migrate block-by-block

Multi level indirect blocks? Each disk address:

list of pointers to replicas

? Each pointer:

disk id + sector no.

Disk Data Structures: FilesDisk Data Structures: Files



13/23

H

R

L

Conventional file systems store data in small blocks to pack data

more densely

GPFS uses large blocks (256KB default) to optimize disk transfer

speed

Large File Block SizeLarge File Block Size

Performance

0 128 256 384 512 640 768 896 1024

I/O Transfer Size (Kbytes)

0

1

2

3

4

5

6

7

Throughput(MB/sec)


14/23

H

R

L

Parallelism and consistencyParallelism and consistency

Distributed locking - acquire appropriate lock for every operation -

used for updates to user data

Centralized management - conflicting operations forwarded to a

designated node - used for file metadata

Distributed locking + centralized hints - used for space allocation

Central coordinator - used for configuration changes

I/O slowdown effects

Additional I/O activity

rather than token server

overload


15/23

H

R

L

GPFS allows parallel applications on multiple nodes to access non-

overlapping ranges of a single file with no conflict

Global locking serializes access to overlapping ranges of a file

Global locking based on "tokens" which convey access rights to an

object (e.g. a file) or subset of an object (e.g. a byte range)

Tokens can be held across file system operations, enabling coherent

data caching in clients

Cached data discarded or written to disk when token is revoked

Performance optimizations: required/desired ranges, metanode, data

shipping, special token modes for file size operations

Parallel File Access From Multiple NodesParallel File Access From Multiple Nodes

Performance


16/23

H

R

L

GPFS stripes successive blocks across successive disks

Disk I/O for sequential reads and writes is done in parallel

GPFS measures application "think time" ,disk throughput, and cache

state to automatically determine optimal parallelism

Prefetch algorithms now recognize strided

and reverse sequential access

Accepts hints

Write-behind policy

Application reads at 15 MB/sec

Each disk reads at 5 MB/secThree I/Os executed in parallel

Deep Prefetch for High ThroughputDeep Prefetch for High Throughput

Performance


17/23

H

R

L

Hardware: Power2 wide

nodes, SSA disks

Experiment: sequential

read/write from large

number of GPFS nodes to

varying number of storage

nodesResult: throughput

increases nearly linearly

with number of storage

nodes

Bottlenecks:

? microchannel limits node

throughput to 50MB/s

? system throughput limited by

available storage nodes

GPFS Throughput Scaling forNonGPFS Throughput Scaling forNon--cached Filescached Files

Scalability


18/23

H

R

L

Segmented Block Allocation

MAP:

Each segment contains bits representing

blocks on all disks

Each segment is a separately lockable unit

Minimizes contention for allocation map

when writing files on multiple nodes

Allocation managerservice provides hints

which segments to try

Similar: inode allocation map

Disk Data Structures: Allocation mapDisk Data Structures: Allocation map

Scalability


19/23

H

R

L

Problem: detect/fix file system inconsistencies after a failure of one

or more nodes

?All updates that may leave inconsistencies if uncompleted are logged

?Write-ahead logging policy: log record is forced to disk before dirty metadata is

written

? Redo log: replaying all log records at recovery time restores file systemconsistency

Logged updates:

? I/O to replicated data

? directory operations (create, delete, move, ...)

? allocation map changes

Other techniques:

? ordered writes

? shadowing

High AvailabilityHigh Availability -- Logging and RecoveryLogging and Recovery

High Availability


20/23

H

R

L

Application node failure:

? force-on-steal policy ensures that all changes visible to other nodes have been

written to disk and will not be lost

? all potential inconsistencies are protected by a token andare logged

? file system manager runs log recovery on behalf of the failed node

after successful log recovery tokens held by the failed node are released? actions taken: restore metadata being updated by the failed node to a

consistent state, release resources held by the failed node

File system manager failure:

? new node is appointed to take over

?

new file system manager restores volatile state by querying other nodes? New file system manager may have to undo or finish a partially completed

configuration change (e.g., add/delete disk)

Storage node failure:

? Dual-attached disk: use alternate path (VSD)

? Single attached disk: treat as disk failure

Node Failure RecoveryNode Failure Recovery

High Availability


21/23

H

R

L

When a disk failure is detected

? The node that detects the failure informs the file system manager

? File system manager updates the configuration data to mark the failed disk as

"down" (quorum algorithm)

While a disk is down

? Read one / write all available copies? "Missing update" bit set in the inode of modified files

When/if disk recovers

? File system manager searches inode file for missing update bits

?All data & metadata of files with missing updates are copied back to the

recovering disk (one file at a time, normal locking protocol)? Until missing update recovery is complete, data on the recovering disk is

treated as write-only

Unrecoverable disk failure

? Failed disk is deleted from configuration or replaced by a new one

? New replicas are created on the replacement or on other disks

Handling Disk FailuresHandling Disk Failures


22/23

H

R

L

CacheManagementCacheManagement

Total CacheGeneral Pool: Clock list, merge, re-map

Block Size pool: Clock list



StatsSeq / random

optimal, totalSeq / random



optimal, total

Balance dynamically according to usage patterns

Avoid fragmentation - internal and externalUnified steal

Periodical re-balancing


23/23

H

R

L

EpilogueEpilogue

Used on six of the ten most powerful supercomputers in the world,

including the largest (ASCI white)

Installed at several hundred customer sites, on clusters ranging from

a few nodes with less than a TB of disk, up to 512 nodes with 140 TB

of disk in 2 file systems

IP rich - ~20 filed patentsState of the art

TeraSort

? world record of 17 minutes

? using 488 node SP. 432 file system and 56 storage nodes (604e 332 MHz)

? total 6 TB disk spaceReferences

? GPFS home page: http://www.haifa.il.ibm.com/projects/storage/gpfs.html

? FAST 2002: http://www.usenix.org/events/fast/schmuck.html

? TeraSort - http://www.almaden.ibm.com/cs/gpfs-spsort.html

? Tiger Shark: http://www.research.ibm.com/journal/rd/422/haskin.html

Documents

General Parallel File System 13