General Parallel File System 13

Embed Size (px)

Citation preview

  • 8/3/2019 General Parallel File System 13

    1/23

    IBM Research Lab in HaifaIBM Research Lab in Haifa

    Architectural and Design Issues inArchitectural and Design Issues in

    the General Parallel File Systemthe General Parallel File System

    Benny MandlerBenny Mandler -- [email protected]@il.ibm.com

    May 12, 2002

  • 8/3/2019 General Parallel File System 13

    2/23

    H

    R

    L

    AgendaAgenda

    What is GPFS?

    ? a file system for deep computing

    GPFS uses

    General architecture

    How does GPFS meet its challenges - architectural issues

    ? performance

    ? scalability

    ? high availability

    ? concurrency control

  • 8/3/2019 General Parallel File System 13

    3/23

    H

    R

    L

    RS/6000 SP Scalable Parallel ComputerRS/6000 SP Scalable Parallel Computer

    11--512 nodes connected by high512 nodes connected by high--speed switchspeed switch

    11--16 CPUs per node (Power2 or PowerPC)16 CPUs per node (Power2 or PowerPC)

    >1 TB disk per node>1 TB disk per node

    500 MB/s full duplex per switch port500 MB/s full duplex per switch port

    Scalable parallel computing enables I/OScalable parallel computing enables I/O--intensive applicationsintensive applications::

    Deep computingDeep computing -- simulation, seismic analysis, data miningsimulation, seismic analysis, data mining

    Server consolidationServer consolidation -- aggregating file, web servers onto a centrallyaggregating file, web servers onto a centrally--managed machinemanaged machine

    Streaming video and audio for multimedia presentationStreaming video and audio for multimedia presentation

    Scalable object store for large digital libraries, web servers, databases, ...Scalable object store for large digital libraries, web servers, databases, ...

    Scalable Parallel ComputingScalable Parallel Computing

    What is GPFS?

  • 8/3/2019 General Parallel File System 13

    4/23

    H

    R

    L

    High PerformanceHigh Performance -- multiple GB/s to/from a single filemultiple GB/s to/from a single file

    concurrentconcurrent reads and writes, parallel data accessreads and writes, parallel data access -- within a file and across fileswithin a file and across files

    Support fully parallel access both to file data and metadataSupport fully parallel access both to file data and metadata

    client cachingclient caching enabled by distributed lockingenabled by distributed locking

    wide striping, large data blocks, prefetchwide striping, large data blocks, prefetch

    ScalabilityScalability

    scales up to 512 nodes (Nscales up to 512 nodes (N--Way SMP). Storage nodes, file system nodes, disks,Way SMP). Storage nodes, file system nodes, disks,

    adapters...adapters...

    High AvailabilityHigh Availability

    faultfault--tolerancetolerance via logging, replication, RAID supportvia logging, replication, RAID support

    survives node and disk failuressurvives node and disk failures

    Uniform accessUniform access via shared disksvia shared disks -- Single image file systemSingle image file system

    High capacityHigh capacity multiple TB per file system, 100s of GB per file.multiple TB per file system, 100s of GB per file.

    Standards compliantStandards compliant (X/Open 4.0 "POSIX") with minor exceptions(X/Open 4.0 "POSIX") with minor exceptions

    GPFS addresses SP I/O requirementsGPFS addresses SP I/O requirements

    What is GPFS?

  • 8/3/2019 General Parallel File System 13

    5/23

    H

    R

    L

    Native AIX File System (JFS)

    No file sharing - application can only accessfiles on its own node

    Applications must do their own data partitioning

    DCE Distributed File System (follow-up of AFS)

    Application nodes (DCE clients) share files on server node

    Switch is used as a fast LAN

    Coarse-grained (file or segment level) parallelism

    Server node is performance and capacity bottleneck

    GPFS Parallel File SystemGPFS Parallel File System

    GPFS file systems are striped acrossGPFS file systems are striped across

    multiple disks on multiple storage nodesmultiple disks on multiple storage nodes

    Independent GPFS instances run on eachIndependent GPFS instances run on each

    application nodeapplication node

    GPFS instances use storage nodes as "blockGPFS instances use storage nodes as "block

    servers"servers" -- all instances can access all disksall instances can access all disks

    GPFS vs. local and distributed file systems on the SP2GPFS vs. local and distributed file systems on the SP2

  • 8/3/2019 General Parallel File System 13

    6/23

    H

    R

    L

    Video on Demand for new "borough" of TokyoVideo on Demand for new "borough" of TokyoApplications: movies, news, karaoke, education ...Applications: movies, news, karaoke, education ...

    Video distribution via hybrid fiber/coaxVideo distribution via hybrid fiber/coax

    Trial "live" since June '96Trial "live" since June '96

    Currently 500 subscribersCurrently 500 subscribers

    6 Mbit/sec MPEG video streams6 Mbit/sec MPEG video streams

    100 simultaneous viewers (75 MB/sec)100 simultaneous viewers (75 MB/sec)

    200 hours of video on line (700 GB)200 hours of video on line (700 GB)

    1212--node SPnode SP--2 (7 distribution, 5 storage)2 (7 distribution, 5 storage)

    Tokyo Video on Demand TrialTokyo Video on Demand Trial

  • 8/3/2019 General Parallel File System 13

    7/23

    H

    R

    L

    Major aircraft manufacturer

    Using CATIA for large designs, Elfini for structural modeling and analysis

    SP used for modeling/analysis

    Using GPFS to store CATIA designs and structural modeling data

    GPFS allows all nodes to share designs and models

    Engineering DesignEngineering Design

    GPFS uses

  • 8/3/2019 General Parallel File System 13

    8/23

    H

    R

    L

    File systems consist of one or more shared disks

    ? Individual disk can contain data, metadata, or both

    ? Disks are designated to failure group

    ? Data and metadata are striped to balance load and

    maximize parallelism

    Recoverable Virtual Shared Disk for accessing diskstorage

    ? Disks are physically attached to SP nodes

    ? VSD allows clients to access disks over the SP switch

    ? VSDclient looks like disk device driver on client node

    ? VSDserverexecutes I/O requests on storage node.? VSD supports JBOD or RAID volumes, fencing, multi-

    pathing (where physical hardware permits)

    GPFS only assumes a conventional block I/O

    interface

    Shared DisksShared Disks -- Virtual Shared Disk architectureVirtual Shared Disk architecture

    General architecture

  • 8/3/2019 General Parallel File System 13

    9/23

    H

    R

    L

    Implications of Shared Disk Model

    ?All data and metadata on globally accessible disks (VSD)

    ?All access to permanent data through disk I/O interface

    ? Distributed protocols, e.g., distributed locking, coordinate disk access from

    multiple nodes

    ? Fine-grained locking allows parallel access by multiple clients? Logging and Shadowing restore consistency after node failures

    Implications of Large Scale

    ? Support up to 4096 disks of up to 1 TB each (4 Petabytes)

    The largest system in production is 75 TB

    ? Failure detection and recovery protocols to handle node failures

    ? Replication and/or RAID protect against disk / storage node failure

    ? On-line dynamic reconfiguration (add, delete, replace disks and nodes;

    rebalance file system)

    GPFS Architecture OverviewGPFS Architecture Overview

    General architecture

  • 8/3/2019 General Parallel File System 13

    10/23

    H

    R

    L

    Three types of nodes: file system, storage, and manager? Each node can perform any of these functions

    ? File system nodes

    run user programs, read/write data to/from storage nodes

    implement virtual file system interface

    cooperate with manager nodes to perform metadata operations

    ? Manager nodes (one per file system)global lock manager

    recovery manager

    global allocation manager

    quota manager

    file metadata manager

    admin services fail over? Storage nodes

    implement block I/O interface

    shared access from file system and manager nodes

    interact with manager nodes for recovery (e.g. fencing)

    file data and metadata striped across multiple disks on multiple storage nodes

    GPFS ArchitectureGPFS Architecture -- Node RolesNode Roles

    General architecture

  • 8/3/2019 General Parallel File System 13

    11/23

    H

    R

    L

    GPFS Software StructureGPFS Software Structure

    General architecture

  • 8/3/2019 General Parallel File System 13

    12/23

    H

    R

    L

    Large block size allows efficient use of disk bandwidth

    Fragments reduce space overhead for small files

    No designated "mirror", no fixed placement function:

    Flexible replication (e.g., replicate only metadata, or only important files)

    Dynamic reconfiguration: data can migrate block-by-block

    Multi level indirect blocks? Each disk address:

    list of pointers to replicas

    ? Each pointer:

    disk id + sector no.

    Disk Data Structures: FilesDisk Data Structures: Files

    General architecture

  • 8/3/2019 General Parallel File System 13

    13/23

    H

    R

    L

    Conventional file systems store data in small blocks to pack data

    more densely

    GPFS uses large blocks (256KB default) to optimize disk transfer

    speed

    Large File Block SizeLarge File Block Size

    Performance

    0 128 256 384 512 640 768 896 1024

    I/O Transfer Size (Kbytes)

    0

    1

    2

    3

    4

    5

    6

    7

    Throughput(MB/sec)

  • 8/3/2019 General Parallel File System 13

    14/23

    H

    R

    L

    Parallelism and consistencyParallelism and consistency

    Distributed locking - acquire appropriate lock for every operation -

    used for updates to user data

    Centralized management - conflicting operations forwarded to a

    designated node - used for file metadata

    Distributed locking + centralized hints - used for space allocation

    Central coordinator - used for configuration changes

    I/O slowdown effects

    Additional I/O activity

    rather than token server

    overload

  • 8/3/2019 General Parallel File System 13

    15/23

    H

    R

    L

    GPFS allows parallel applications on multiple nodes to access non-

    overlapping ranges of a single file with no conflict

    Global locking serializes access to overlapping ranges of a file

    Global locking based on "tokens" which convey access rights to an

    object (e.g. a file) or subset of an object (e.g. a byte range)

    Tokens can be held across file system operations, enabling coherent

    data caching in clients

    Cached data discarded or written to disk when token is revoked

    Performance optimizations: required/desired ranges, metanode, data

    shipping, special token modes for file size operations

    Parallel File Access From Multiple NodesParallel File Access From Multiple Nodes

    Performance

  • 8/3/2019 General Parallel File System 13

    16/23

    H

    R

    L

    GPFS stripes successive blocks across successive disks

    Disk I/O for sequential reads and writes is done in parallel

    GPFS measures application "think time" ,disk throughput, and cache

    state to automatically determine optimal parallelism

    Prefetch algorithms now recognize strided

    and reverse sequential access

    Accepts hints

    Write-behind policy

    Application reads at 15 MB/sec

    Each disk reads at 5 MB/secThree I/Os executed in parallel

    Deep Prefetch for High ThroughputDeep Prefetch for High Throughput

    Performance

  • 8/3/2019 General Parallel File System 13

    17/23

    H

    R

    L

    Hardware: Power2 wide

    nodes, SSA disks

    Experiment: sequential

    read/write from large

    number of GPFS nodes to

    varying number of storage

    nodesResult: throughput

    increases nearly linearly

    with number of storage

    nodes

    Bottlenecks:

    ? microchannel limits node

    throughput to 50MB/s

    ? system throughput limited by

    available storage nodes

    GPFS Throughput Scaling forNonGPFS Throughput Scaling forNon--cached Filescached Files

    Scalability

  • 8/3/2019 General Parallel File System 13

    18/23

    H

    R

    L

    Segmented Block Allocation

    MAP:

    Each segment contains bits representing

    blocks on all disks

    Each segment is a separately lockable unit

    Minimizes contention for allocation map

    when writing files on multiple nodes

    Allocation managerservice provides hints

    which segments to try

    Similar: inode allocation map

    Disk Data Structures: Allocation mapDisk Data Structures: Allocation map

    Scalability

  • 8/3/2019 General Parallel File System 13

    19/23

    H

    R

    L

    Problem: detect/fix file system inconsistencies after a failure of one

    or more nodes

    ?All updates that may leave inconsistencies if uncompleted are logged

    ?Write-ahead logging policy: log record is forced to disk before dirty metadata is

    written

    ? Redo log: replaying all log records at recovery time restores file systemconsistency

    Logged updates:

    ? I/O to replicated data

    ? directory operations (create, delete, move, ...)

    ? allocation map changes

    Other techniques:

    ? ordered writes

    ? shadowing

    High AvailabilityHigh Availability -- Logging and RecoveryLogging and Recovery

    High Availability

  • 8/3/2019 General Parallel File System 13

    20/23

    H

    R

    L

    Application node failure:

    ? force-on-steal policy ensures that all changes visible to other nodes have been

    written to disk and will not be lost

    ? all potential inconsistencies are protected by a token andare logged

    ? file system manager runs log recovery on behalf of the failed node

    after successful log recovery tokens held by the failed node are released? actions taken: restore metadata being updated by the failed node to a

    consistent state, release resources held by the failed node

    File system manager failure:

    ? new node is appointed to take over

    ?

    new file system manager restores volatile state by querying other nodes? New file system manager may have to undo or finish a partially completed

    configuration change (e.g., add/delete disk)

    Storage node failure:

    ? Dual-attached disk: use alternate path (VSD)

    ? Single attached disk: treat as disk failure

    Node Failure RecoveryNode Failure Recovery

    High Availability

  • 8/3/2019 General Parallel File System 13

    21/23

    H

    R

    L

    When a disk failure is detected

    ? The node that detects the failure informs the file system manager

    ? File system manager updates the configuration data to mark the failed disk as

    "down" (quorum algorithm)

    While a disk is down

    ? Read one / write all available copies? "Missing update" bit set in the inode of modified files

    When/if disk recovers

    ? File system manager searches inode file for missing update bits

    ?All data & metadata of files with missing updates are copied back to the

    recovering disk (one file at a time, normal locking protocol)? Until missing update recovery is complete, data on the recovering disk is

    treated as write-only

    Unrecoverable disk failure

    ? Failed disk is deleted from configuration or replaced by a new one

    ? New replicas are created on the replacement or on other disks

    Handling Disk FailuresHandling Disk Failures

  • 8/3/2019 General Parallel File System 13

    22/23

    H

    R

    L

    CacheManagementCacheManagement

    Total CacheGeneral Pool: Clock list, merge, re-map

    Block Size pool: Clock list

    Block Size pool: Clock list

    Block Size pool: Clock list

    StatsSeq / random

    optimal, totalSeq / random

    optimal, totalSeq / random

    optimal, totalSeq / random

    optimal, total

    Balance dynamically according to usage patterns

    Avoid fragmentation - internal and externalUnified steal

    Periodical re-balancing

  • 8/3/2019 General Parallel File System 13

    23/23

    H

    R

    L

    EpilogueEpilogue

    Used on six of the ten most powerful supercomputers in the world,

    including the largest (ASCI white)

    Installed at several hundred customer sites, on clusters ranging from

    a few nodes with less than a TB of disk, up to 512 nodes with 140 TB

    of disk in 2 file systems

    IP rich - ~20 filed patentsState of the art

    TeraSort

    ? world record of 17 minutes

    ? using 488 node SP. 432 file system and 56 storage nodes (604e 332 MHz)

    ? total 6 TB disk spaceReferences

    ? GPFS home page: http://www.haifa.il.ibm.com/projects/storage/gpfs.html

    ? FAST 2002: http://www.usenix.org/events/fast/schmuck.html

    ? TeraSort - http://www.almaden.ibm.com/cs/gpfs-spsort.html

    ? Tiger Shark: http://www.research.ibm.com/journal/rd/422/haskin.html