Upload
vandung
View
226
Download
0
Embed Size (px)
Citation preview
Qing Zheng Lin Xiao, Kai Ren, Garth Gibson
File System Architecture
Low-level Storage Infrastructure
Metadata Service APP APP APP APP
APP APP APP APP
I/O operations
metadata operations
PDL Group Meeting Parallel Data Lab - http://www.pdl.cmu.edu/ 2
Parallel data path with decoupled metadata path
[flat namespace]
Metadata = 1 + 2 + 3
PDL Group Meeting Parallel Data Lab - http://www.pdl.cmu.edu/ 3
Metadata Representation
File System App Library
Namespace Indexing
Small File Storage
Large File Block Indexing
2 1 3
Decoupled != Scalable Single metadata server HDFS, Lustre 1.x
Statically partitioned metadata servers PVFS, Federated HDFS, NFS v4.1
PDL Group Meeting Parallel Data Lab - http://www.pdl.cmu.edu/ 4
Many existing metadata service don’t scale
Our Goal is To Have Really
Scalable Metadata
PDL Group Meeting Parallel Data Lab - http://www.pdl.cmu.edu/ 5
Our Goal is To Have Really
Scalable Metadata
PDL Group Meeting Parallel Data Lab - http://www.pdl.cmu.edu/ 6
Outline 1. Pathname lookup
important limitation on scalability
2. Client-side caching represented by IndexFS
3. Replicated state represented by ShardFS
4. Experimental results
Path Resolution Hierarchical permission checking
PDL Group Meeting Parallel Data Lab - http://www.pdl.cmu.edu/ 7
In order to resolve /a/b/c , need to test /a , a/b , b/c , … /a/b/c/… /a a/b b/c …
1) Permissions to lookup names under an intermediate directory 2) The existence of the name 3) The name represents a directory
A set of recursive tests starting from the root
2 Categories of Ops
PDL Group Meeting Parallel Data Lab - http://www.pdl.cmu.edu/ 8
Parent Dir Id Obj Name ObjSize ObjMode UserId GroupId Times Embedded File Data ObjId Other Metadata
mkdir rename on dirs
rmdir
chmod on dirs
chown on dirs
mknod/create unlink
rename on files chmod/chown on files
utime/utimes
“Dir lookup state mutation ops” “Non-conflicting ops”
Naive Implementation 1 lookup RPC to server for each intermediate dir
Parallel Data Lab - http://www.pdl.cmu.edu/
Client
/a/b/c/… Server 0
/a
Client Side Server Side
Server 1
a/b Server 2
b/c
Client
/a/d/… Other Clients
1. Repeated RPCs
2. Hot spot
servers holding
names at the top
of the tree
BOTTLENECKS
PDL Group Meeting 9
Outline 1. Pathname lookup
important limitation on scalability
2. Client-side caching represented by IndexFS
3. Replicated state represented by ShardFS
4. Experimental results
PDL Group Meeting 10
Design Choice #1 Outline 1. Pathname lookup
important limitation on scalability
2. Client-side caching represented by IndexFS
3. Replicated state represented by ShardFS
4. Experimental results
PDL Group Meeting Parallel Data Lab - http://www.pdl.cmu.edu/ 11
Lease and cache dir lookup states at clients Block mutation ops until all leases have expired
Expiration-time Cache-Entry
client-side cache table
Max-expiration-time Cache-Id
server-side cache table
Fewer repeated RPCs & simple server states
IndexFS Design Distributes namespace on a per-dir partition basic Path resolution conducted by clients with an consistent lease-based lookup cache
PDL Group Meeting Parallel Data Lab - http://www.pdl.cmu.edu/ 12
c
/
a
b d
e 1 2 3
/a
a/c
c/2
1
ROOT
3 2
a
d
b e
c
client cache
Outline 1. Pathname lookup
important limitation on scalability
2. Client-side caching represented by IndexFS
3. Replicated state represented by ShardFS
4. Experimental results
PDL Group Meeting Parallel Data Lab - http://www.pdl.cmu.edu/ 13
Design Choice #2 Outline 1. Pathname lookup
important limitation on scalability
2. Client-side caching represented by IndexFS
3. Replicated state represented by ShardFS
4. Experimental results
PDL Group Meeting Parallel Data Lab - http://www.pdl.cmu.edu/ 14
Replicates dir lookup states to all servers & broadcasts mutation ops to all servers
mkdir
Server 1 Server 2 Server 3 Server n
Principally a better decision if #client >> #server
ShardFS Design Distributes namespace on a per-file basic (sharding) All metadata servers can accept new files and perform path resolution
PDL Group Meeting Parallel Data Lab - http://www.pdl.cmu.edu/ 15
c
/
a
b d
e 1 2 3
ROOT
2
ROOT
ROOT ROOT
1
3
client /a/b/1
/a/c/3
a
b
d
c
e
RPC Amplification
PDL Group Meeting Parallel Data Lab - http://www.pdl.cmu.edu/ 16
path resolution mknod
unlink/getattr mkdir*
rmdir*/readdir chmod/chown on file
chmod*/chown* on dir utime on file/dir
0 1 1
#metadata_servers #metadata_servers
1 #metadata_server
1
0 ~ #path_depth 1 1
1 + 1 1 + #partitions
1 1 1
#path_lookups +
IndexFS ShardFS * dir lookup state mutation op
Micro Benchmark YCSB++ [SoCC11 ]
Balanced tree 1280 files per leaf dir
Zipfian tree files distributed unevenly, creating static hot spots
Uniform/zipfian stat’s random getattrs on files, creating dynamic hot spots
PDL Group Meeting Parallel Data Lab - http://www.pdl.cmu.edu/ 17
5 levels
fan out=10
Root dir
Leaf dirs
111K dirs
128M files
Th
rou
gh
pu
t
PDL Group Meeting Parallel Data Lab - http://www.pdl.cmu.edu/ 18
10,829
4,728
8,819 8,507
0
4,000
8,000
12,000
16,000
20,000
Balanced Tree Zipfian Tree
IndexFS ShardFS
3,071 3,635
14,654 15,502
Balanced Tree Zipfian Tree
IndexFS ShardFS
2,022 2,105
6,529 6,356
Balanced Tree Zipfian Tree
IndexFS ShardFS
Creation Phase Uniform Stat’s Zipfian Stat’s op/s
metadata footprint
dir splitting load variance
path resolution
hot spots on files
Macro Benchmark YCSB++ [SoCC11 ]
Strong scaling fixed namespace and fixed fs ops
Weak scaling fixed dir structure and fixed dir lookup state mutation ops
with scaling number of files and scaling number of other fs ops
LinkedIn trace replay 1-day HDFS trace with 1.9M dirs & 11.4M files
PDL Group Meeting Parallel Data Lab - http://www.pdl.cmu.edu/ 19
s s s s s s s s s s s s
strong scaling 1 2 3 1 2 3
chmod /a open /a/1
chmod /a open /a/1 a a
s s s s s s s s s s s s
weak scaling 1 2 3
chmod /a open /a/1
chmod /a open /a/1 open /a/α
1 β γ α 2 2
a a
PDL Group Meeting Parallel Data Lab - http://www.pdl.cmu.edu/ 20
80
160
320
640
1280
8 16 32 64 128 8 16 32 64 128
IndexFS-1m IndexFS-250ms
IndexFS-50ms ShardFS
IndexFS-1m IndexFS-250ms
IndexFS-50ms ShardFS
number of metadata servers
Th
rou
gh
pu
t Kop/s Strong Scaling Weak Scaling
References IndexFS: Scaling File System Metadata Performance with Stateless Caching and Bulk Insertion. Kai Ren, Qing Zheng, Swapnil Patil and Garth Gibson. SC 2014
TableFS: Enhancing metadata efficiency in local file systems. Kai Ren and Garth Gibson. USENIX ATC 2013
Scale and Concurrency in GIGA+: File System Directories with Millions of Files. Swapnil Patil and Garth Gibson. FAST 2009
YCSB++: Benchmarking and Performance Debugging Advanced Features in Scalable Table Stores. Swapnil Patil, Milo Polte, Kai Ren, Wittawat Tantisiriroj, Lin Xiao, Julio Lopez, Garth Gibson, Adam Fuchs, Billie Rinaldi. SoCC 2011
PDL Group Meeting Parallel Data Lab - http://www.pdl.cmu.edu/ 21
BACKUP SLIDES
Target Namespace 90% dirs are small (less than 128 entries)
Large dirs are really huge
90% dirs are of depth 16 or more
Median file size smaller then 64KB for many fs’es
PDL Group Meeting Parallel Data Lab - http://www.pdl.cmu.edu/ 23
Distribution of FS Ops
PDL Group Meeting Parallel Data Lab - http://www.pdl.cmu.edu/ 24
0%
20%
40%
60%
80%
100%
LinkedIn OpenCloud Yahoo! PROD Yahoo! R&D
Mknod Open Chmod Rename Mkdir Readdir Remove
ShardFS/IndexFS Overview
PDL Group Meeting Parallel Data Lab - http://www.pdl.cmu.edu/ 25
Application
Metadata Server (ShardFS/IndexFS Server)
DFS Metadata Server
DFS Data Server
Metadata Client (ShardFS/IndexFS Client)
metadata storage
metadata operations DFS internal operations
data storage
block operations
Underlying Distributed File System ShardFS or IndexFS
namespace indexing
small file storage
large file block indexing
Metadata Representation Log-structured and indexed data structure
PDL Group Meeting Parallel Data Lab - http://www.pdl.cmu.edu/ 26
[(k,v), (k,v), ..., (k,v)]
mkdir
In-mem buffer
SSTable1 SSTable2 SSTable3
Key-Value Store
ShardFS/IndexFS Servers
TableFS [ATC13] IndexFS [SC14]
Namespace Metadata = dir index + object attributes + file data for small files
PDL Group Meeting Parallel Data Lab - http://www.pdl.cmu.edu/ 27
Parent Dir Id Obj Name ObjSize ObjMode UserId GroupId Times Embedded File Data ObjId Other Metadata
Parent Dir Id Obj Name ObjSize ObjMode UserId GroupId Times Embedded File Data ObjId Other Metadata
Parent Dir Id Obj Name ObjSize ObjMode UserId GroupId Times Embedded File Data ObjId Other Metadata
Parent Dir Id Obj Name ObjSize ObjMode UserId GroupId Times Embedded File Data ObjId Other Metadata
Key Value
Sorte
d