Scalable metadata management at very large scale filesystems: A Survey

1

Scalable metadata management at very large-scale file systems: a Survey

Presented by Viet-Trung TRAN KerData Team

2

Outline   Overview of metadata management

  Motivation: Distributed metadata management

  Metadata distribution strategies   Static sub-tree Partitioning   Hashing   Lazy Hybrid   Dynamic sub-tree partitioning   Probabilistic lookup

  Metadata partitioning granularity

  Hierarchical file systems are dead

  Gfarm/BlobSeer: a Versioning distributed file system

3

Overview of metadata management   Hierarchical file systems

  Name space/Directory service   Maps human readable name to file identifier (/a/b/c -> 232424)

  File location service   Maps file identifier to distributed file parts (232424 -> {block, object addresses})

  Access control service

  Flat file systems (Amazon S3, some obsolete ones: CFS, PAST)   No Name space/Directory service

4

Motivation   50% to 80% of file systems accesses are to metadata [Ousterhout et al.,1985]

  Current strength: Distributed object-based file systems   Object-based storages manage data   Metadata servers manage metadata

  Provide the ability to efficiently handle scalable I/O

  But limited performance of metadata server (MDS) still becomes critical to overall system performance

  At very large-scale   Number of files/directories can be trillions   Number of concurrent users : ???

  => Scalable metadata management in a distributed manner is crucial

5

Metadata distribution strategies   Problem

  Multiple metadata servers, each handles a part of the whole namespace   Input: File pathname   Output: the metadata server which owns the correspondent file metadata

  Constrains   Load balancing   Lookup time   Migration cost up on the configuration   Directory operations (e.g: ls)   Scalability   Locking

  Solutions ???

6

Static sub-tree partitioning   NFS, Sprite, AFS, Coda.

  The name space is partitioned at system configuration stage

  A system administrator decides how the system should be distributed

  Manually assign sub-trees of the hierarchy to individual servers.

  Pros   Simple client tasks of identifying servers responsible for metadata   No inner communication between servers   Servers are independent each other   Concurrent processing among different sub-trees

  Cons   No load balancing   Granularity is very large

7

Hashing   Lustre, Vesta, zFS

  Hash the file pathname/ or file global identifier to the location of the metadata (the correspondent metadata server)

  Pros   Load balancing   Lookup time O(1)

  Cons   Hashing eliminated all hierarchical locality   Migration cost: large (Need to rehash when changing the configuration/number of

servers)   Very slow in some directory operations: rename, create link   Hard to satisfy POSIX directory access semantics

8

Lazy Hybrid   Scott et al., 2003

  Seek to capitalize on the benefits of a hashed distribution while avoiding the problem associated with path traversal when checking permission

  Still rely on the mapping of full pathname to distribute metadata

  A small modification: Uses the hash value as an index in the MLT rather the metadata identifier-> to facilitate the addition and the removal of servers

9

Lazy Hybrid   Avoid directory traversal for permission checking

  =>Using a dual entry access control list (ACL) structure for managing permissions. Each file or directory has two ACLs representing   File permissions   Path permissions: Can be constructed recursively. Only updated when an ancestor

directory’s access permissions are changed

10

Lazy Hybrid ‒ Lazy policies   4 expensive operations

  Changing permissions on a directory   Changing the name of a directory   Removing a directory   Changing the MLT

  => message exchanges, migration metadata

  Lazy policies: execute first, the metadata is updated later up on the first access   Invalidation: by using inner

communication between servers   Lazy metadata update and relocation

(recursively update up to root)

11

Lazy Hybrid   Pros

  Avoid directory traversal in most case   Pros of pure hashing (load balancing, lookup time, recovery)   Good scalability

  Cons   No locality benefits (A small modification affects multiple metadata servers)   Individual hot-spots popular files (Can be improved if we can dynamically replicate

the associated metadata. Is it possible on a DHT?)

12

Dynamic sub-tree partitioning - Ceph   Weil et al., 2004, 2006

  Ceph dynamically maps sub-trees of the directory hierarchy to metadata servers based on the current workload.

  Individual directories are hashed across multiple nodes only when they become hot spots.

  Key design: Partitioning by path name

13

Dynamic sub-tree partitioning - Ceph   No distributed locking (each metadata has its authority MDS)

  Authority MDS serializes accesses to the metadata

  Collaborative caching

  Pros   Embedded inodes in dentry to maximize directory operations   Locality benefits   Load balancing (the number of replicas are dynamically adjusted)

  Cons   Needs an accurate load measurement   Migration cost for addition/removal of servers   Not clearly to understand the paper

14

Dynamic sub-tree partitioning - Farsite   Douceur et al.,2006

  Partitioning by pathname complicates renames across partitions

  Partitioning by file identifier which is not mutable can be a better approach

  Tree-structured file identifiers: is encored by a variant of Elias y’ coding?

  Client uses file map to know which servers manages which regions of file-identifier space

15

Dynamic sub-tree partitioning - Farsite   Two-phase locking for ensuring the atomic rename operation

  Leader: the destination directory   2 followers: the source directory, the file being renamed   Followers validates its part, lock the relevant metadata, and notify the leader   Leader decides whether the update is valid

  Dynamic partitioning policies   Not yet developed optimal policies   A file to be active if the file’s metadata has been accessed within a 5 minutes interval   The load on a region of file identifier space is the count of active files in the region.   Transfer a few, heavily loaded sub-trees rather than in many lightly-loaded sub-trees

16

Dynamic sub-tree partitioning - Farsite   Pros

  Fast creating and renaming files   Dynamic partitioning   Requires less multi-server operations

  Cons   Requires directory traversals from root to the desired file since the partitioning is not

based on name (can be reduced by caching)?

17

Probabilistic lookup ‒ Bloom filter based   Chinese working stuffs

  [1] Y. Zhu, H. Jiang, J. Wang, and F. Xian, "HBA: Distributed Metadata Management for Large Cluster-Based Storage Systems," IEEE Trans. Parallel Distrib. Syst., vol. 19, 2008, p. 750‒763.

  [2] Y. Hua, Y. Zhu, H. Jiang, D. Feng, and L. Tian, "Scalable and Adaptive Metadata Management in Ultra Large-Scale File Systems," 2008 The 28th International Conference on Distributed Computing Systems, 2008, pp. 403-410.

  [3] Y. Hua, H. Jiang, Y. Zhu, D. Feng, and L. Tian, "SmartStore: A New Metadata Organization Paradigm with Semantic-Awareness for Next-Generation File Systems," System, 2009.

18

Probabilistic lookup ‒ Bloom filter based   The Bloom filter is a fast and space-efficient probabilistic data structure that is

used to test whether an element is a member of a set. Elements can be added to the set, but not removed. The more elements that are added to the set, the larger the probability of false positives.

  False positive: the test returns true but the element is not in the set.

19

Pure Bloom filter array approach   Each MDS handles an array of BFs

  One BF has m bits, representing all files whose metadata is store locally   Other BFs, are BF replicas from other MDSs

  Clients randomly chose a MDS to ask if the MDS owns the file’s metadada   Sent (a full pathname)   The MDS uses k hash functions to calculate k array positions.   The MDS checks the array of BFs, returns the correct MDS or the correct metadata

if it has

20

Pure Bloom filter array approach   Pros

  No load balancing (the authors said YES)   Low migration cost up on the addition/removal of MDS (a single BF need to be

added or deleted)

  Cons   No load balancing. The approach doesn’t care about how to distribution the

metadata. It’s random   False positive/ the accuracy of PBA degrades quickly when increasing the number of

files   When a file or directory is renamed, although only rebuild the BFs associated with all

the involved files but it may take a lot of time since rehashing them all with k functions.

21

Hierarchical Bloom filter array approach   Assumption: a small portion of files absorb most of the I/O activities

  Ensure   Least recently used (LRU) BFs (high bit/file ratio) cache the recently visited files,

which are replicated globally among all MDSs. => ensure high hit rate

22

Hierarchical Bloom filter array approach   Pros

  Higher hit rate

  Cons   Simulation work   Hash (a full pathname) -> Directory traversal for checking permissions. Can be

improved by using a dual entry access control list (ACL)

23

Metadata partitioning granularity   Problem

  Efficiently organize and maintain very large directories, each contains billion of files   High metadata performance within the capacity of parallel modifications of metadata

  Example   Directory with billions of entries can grow to 10-100 GB in size   A directory = {Dentry¦ Dentry = (name, inode)}   Concurrently update a single huge directory   The synchronization mechanism of updating metadata greatly restricts the

parallelization

  Scalable distributed directory’s inner structure to facilitate the parallelization?

24

Scalable Distributed directories   No Partitioning (Single metadata server)

  Sub-tree partitioning   Static (NFS, Sprite)   Dynamic (Ceph, Farsite)

  File partitioning   Uses hash to distribute metadata   Uses Bloom filter based (hash again)

  Partitioning within a single directory   GPFS from IBM, Frank et al.,2002   GIGA+, Patil et al., 2007   A modification of GIGA+, Xing et al.,2009

25

GIGA+: Scalable directories for Shared file systems   Focuses on

  Maintain UNIX file-system semantics

  High throughput and scalability   Incremental growth   Minimal bottlenecks and shared

state synchronization

26

GIGA+: Scalable directories for Shared file systems   A directory is dynamically fragmented in a number of partitions

  Extendible hashing to provide incremental growth

27

GIGA+: Scalable directories for Shared file systems   Map Directory partitions to MDSs?

  Client caches a partition-to-server map (P2SMap)   Clients may have the inconsistent copies of P2SMap   GIGA+ keep the history of the fragmenting process

28

GIGA+: Scalable directories for Shared file systems

29

GIGA+: Scalable directories for Shared file systems   Hash (directory full path name) to the home server

  Hash (file name) to the partition ID

  P2SMap is represented as a bitmap

30

GIGA+: Scalable directories for Shared file systems   Two level metadata

  Infrequently updated metadata is managed at a centralized MDS   Owner, creation time, …

  Only highly dynamic attributes are managed across all servers managing the directory

  Modification time, access time, …

  Pros   Parallel processing within a directory

  Cons

  Only efficient for large directories

31

Summary   Some small notes for distributed metadata management

  A dual entry access control list   Embedded inode   Two level metadata

  Can we build a system which is better or at least is a combination of GIGA+, dynamic sub-tree partition, and LH?   At least, things come up on my mind was to improve BlobSeer version manager

  “Distributed BLOB management based on GIGA+ approach”   Each version manager manages a part of BLOB namespace   This idea must be refined

32

Hierarchical file systems are dead   Context

  Hierarchical file system namespace is over forty years olds.   A typical disk was 300 MB and now is closer to 300 GB   Billion files and directories

  Proposes   New API   Indexing   Semantic

33

Hierarchical file systems are dead   [1] S. Ames, C. Maltzahn, and E. Miller, "Quasar: A Scalable Naming Language for Very Large File

Collections," Citeseer, 2008.

  [2] A. Leung, A. Parker-Wood, and E. Miller, "Copernicus: A Scalable, High-Performance Semantic File System," ssrc.ucsc.edu, 2009.

  [3] A. Leung, I. Adams, and E. Miller, "Magellan: A Searchable Metadata Architecture for Large-Scale File Systems," Systems Research, 2009.

  [4] A. Leung, M. Shao, T. Bisson, S. Pasupathy, and E. Miller, "Spyglass: Fast, scalable metadata search for large-scale storage systems," Proceedings of the 7th USENIX Conference on File and Storage Technologies (FAST’09), 2009.

  [5] S. Patil, G.A. Gibson, G.R. Ganger, J. Lopez, M. Polte, W. Tantisiroj, and L. Xiao, "In search of an API for scalable file systems: Under the table or above it," 2009, pp. 1-5.

  [6] M. Seltzer and N. Murphy, "Hierarchical file systems are dead," usenix.org, 2009.

  [7] Y. Hua, H. Jiang, Y. Zhu, D. Feng, and L. Tian, "SmartStore: A New Metadata Organization Paradigm with Semantic-Awareness for Next-Generation File Systems," System, 2009.

34

Gfarm/BlobSeer: A versioning file system   Motivation

  A Versioning file system   Versioning access control   Time-based access which enable root to rollback the whole system to at a given

time   Consistency semantics   Version granularity: on-close, on-snapshot, on-write

  Can be specified per each file   Replication technique   Efficient versioning management with respect to space and workload

35

Gfarm/Blobseer: Key design   Versioned data’s access control

  Versioning flag as an addition access mode to the standard file system interface   Control which user/group can access versioned data   Control whether a file/directory must be versioned or not   Built on ACL (need to check if Gfam supports ACL)

  Normal user access   Control access based on ACL

  Administrative access   Only the root administrator is able to rollback the whole system by using some root access commands

36

Gfarm/Blobseer: Key design   How to access the system

  Version-based API   A File handle and a desired version as the API parameters

  Time-based API   A File handle and a timestamp as the API parameters

  The system should provide also version-able directory operations

  To allow time-based access, clock is synchronized between the system components. There is two cases   Time-based access to BLOB   Clock synchronization is a assumption   Algorithm for ensuring clock synchronization

37

Gfarm/Blobseer: Key design   Time-based access scheme

38

Gfarm/Blobseer: Key design   Consistency semantics: There is 3 cases

  Read-Read: Nothing must be taken into consideration. Caches, replicas are welcome.

  Write-Write: Write operations can concurrently generate new Blob versions. Caches are ok, but only one replica is valid to be writen concurrently.

  Read-Write: Two cases: (1) Read a desired version: caches, replicas are ok. (2) Read with live up-to-date flag: disabled caches on all writers and on only readers with the live up-to-date flag. Disable all replicas.

  Versioning granularity: handled per file, and based on extended attributes

39

Gfarm/Blobseer: Key design   Replication technique

  Count on BlobSeer replication rather on Gfarm replication

  Versioning awareness: an efficient decoupled version management in order to save spaces and management workload   Versioned data on BlobSeer   Versioned metadata

  Inodes and directory structure on the metadata server   Infrequent file attributes on the metadata server   File attributes associated to each single version on BlobSeer

40

Gfarm/Blobseer: Implementation   Versioned directory structure: work on the Gfarm metadata server

  Directory structure is reorganized as a multi-version b-tree   [1] B. Becker, S. Gschwind, T. Ohler, B. Seeger, and P. Widmayer, "An

asymptotically optimal multiversion B-tree," The VLDB Journal, vol. 5, 1996, p. 264‒275.

  [2] C. Soules, G. Goodson, J. Strunk, and G. Ganger, "Metadata efficiency in versioning file systems," Proceedings of the 2nd USENIX Conference on File and Storage Technologies, USENIX Association, 2003, p. 58.

  [3] T. Haapasalo and I. Jaluta, "Transactions on the Multiversion B + -Tree," Technology, 2009.

41

Gfarm/Blobseer: Implementation   Multi-version B-tree

  Given a timestamp, Multi-version B-tree can efficiently return the entries that exist at that time

42

Gfarm/Blobseer: Implementation   Infrequent updated metadata will be handled in a journal-based approach?

  Map versioned file attributes to versioned object attributes on BlobSeer   Modify BlobSeer version manager to handle object attributes   Or distribute them somewhere on the dht

43

Gfarm/Blobseer: Implementation   Time-based access to BLOB

  Assign creation time together with version number to each new created BLOB version

  Given a pair {BLOB ID, a timestamp} -> the version manager can map it to a pair {BLOB ID, version number)

  Replication policy   Cluster awareness on BlobSeer in order to gain access performance to each replica

44

Conclusion   What can I work on?

  Versioning access control   Multi-version B-tree   Time-based access to BlobSeer   Cluster-awareness to BlobSeer   Clock synchronization   Some more recent papers on versioning file system. They was found but not yet be

read.

  If I can earn some papers? Which can be:   Security control in a versioning file system   Gfarm/BlobSeer: a versioning distributed file system   Scalable dynamic distributed version/BLOB management

45

MERCI DE VOTRE ATTENTION!

Documents

Scalable metadata management at very large scale filesystems: A Survey