53
Architecture of a NextGeneration Parallel File System

Architecture of a Next-Generation Parallel File System

Embed Size (px)

DESCRIPTION

Great Wide Open - Day 1 Boyd Wilson - Omnibond 4:30 PM - Emerging

Citation preview

Page 1: Architecture of a Next-Generation Parallel File System

Architecture  of  a    Next-­‐Generation  Parallel  File  System  

Page 2: Architecture of a Next-Generation Parallel File System

-­‐  Introduction  -­‐  Whats  in  the  code  now  -­‐  futures  

Agenda  

Page 3: Architecture of a Next-Generation Parallel File System

An  introduction  

Page 4: Architecture of a Next-Generation Parallel File System

What  is  OrangeFS?  

•  OrangeFS  is  a  next  generation  Parallel  File  System  •  Based  on  PVFS  •  Distributes  file  data  across  multiple  file  servers  

leveraging  any  block  level  file  system.  •  Distributed  Meta  Data  across  1  to  all  storage  servers    •  Supports  simultaneous  access  by  multiple  clients,  

including  Windows  using  the  PVFS  protocol  Directly  •  Works  w/  standard  kernel  releases  and  does  not  require  

custom  kernel  patches  •  Easy  to  install  and  maintain  

Page 5: Architecture of a Next-Generation Parallel File System

Why  Parallel  File  System?  

HPC  –  Data  Intensive   Parallel  (PVFS)  Protocol  

•  Large  datasets  

•  Checkpointing  

•  Visualization  

•  Video  

•  BigData  

 

Unstructured  Data  Silos   Interfaces  to  Match  Problems  

•  Unify  Dispersed  File  Systems  

•  Simplify  Storage  Leveling  

§  Multidimensional  arrays  

§   typed  data    §  portable  formats  

Page 6: Architecture of a Next-Generation Parallel File System

Original  PVFS  Design  Goals  §  Scalable  §  Configurable  file  striping  §  Non-­‐contiguous  I/O  patterns  §  Eliminates  bottlenecks  in  I/O  path  §  Does  not  need  locks  for  metadata  ops  §  Does  not  need  locks  for  non-­‐conflicting  applications  

§  Usability  §  Very  easy  to  install,  small  VFS  kernel  driver  §  Modular  design  for  disk,  network,  etc  §  Easy  to  extend  -­‐>  Hundreds  of  Research  Projects  have  

used  it,  including  dissertations,  thesis,  etc…  

Page 7: Architecture of a Next-Generation Parallel File System

OrangeFS  Philosophy    •  Focus  on  a  Broader  Set  of  Applications  •  Customer  &  Community  Focused  

•  (>300  Member  Strong  Community  &  Growing)  

•  Open  Source  •  Commercially  Viable  •  Enable  Research    

Page 8: Architecture of a Next-Generation Parallel File System

Configurability  

Performance  

Consistency  Reliability  

Page 9: Architecture of a Next-Generation Parallel File System

System  Architecture  

•  OrangeFS  servers  manage  objects  •  Objects  map  to  a  specific  server  •  Objects  store  data  or  metadata  •  Request  protocol  specifies  operations  on  one  or  

more  objects    

•  OrangeFS  object  implementation  •  DB  for  indexing  key/value  data  •  Local  block  file  system  for  data  stream  of  bytes  

Page 10: Architecture of a Next-Generation Parallel File System

Current  Architecture  

Client   Server  

Page 11: Architecture of a Next-Generation Parallel File System

1994-­‐2004  Design  and  Development  at  CU  Dr.  Ligon  +  ANL  (CU  Graduates)  

Primary  Maint  &  Development  ANL  (CU  Graduates)  +  Community  

2004-­‐2010  

2007-­‐2010   New  PVFS  Branch  

SC10  (fall  2010)  

2015  

Announced  with  community  and  is  now    Mainline  of  future  development  as  of  2.8.4    

Spring  2012  

New  Development  focused  on  a  broader  set  of  problems  

SC11  (fall  2011)  

Performance  improvements,    Direct  Lib  +  Cache  Stability,  WebDAV,  S3  

PVFS2  PVFS  

Improved  MD,  Stability,  Server  Side  Operations,  

Newer  Kernels,  Testing  Windows  Client,  Stability,  Replicate  on  Immutable  

2.8.6  +  Webpack  

2.8.5  +  Win   Support  and  Targeted  Development  Services  Initially  Offered  by  Omnibond  

OrangeFS    3.0  

Summer  2014  Distributed  Dir  MD,  Capability  based  security  2.9.0  

Winter  2013  Performance  improvements,    Stability,  2.8.7  +  Webpack  

Spring  2014  Performance  improvements,  Stability,  shared  mmap,  multi  TCP/IP  Server  Homing,  Hadoop    MapReduce,  user  lib  fixes,  new  spec  file  for  RPMS  +  DKMS  2.8.8  +  Webpack  

Available  in  the  AWS  Marketplace  

Replicated  MD,  File  Data,  128  bit  UUID  for  File  Handles,  Parallel  Background  Processes,  web  based  Mgt  Ui,  self  healing  processes,  data  balancing  

Page 12: Architecture of a Next-Generation Parallel File System

In  the  Code  now  

Page 13: Architecture of a Next-Generation Parallel File System

Server  to  Server  Communications  (2.8.5)    

Traditional  Metadata  Operation  

 Create  request  causes  client  to  communicate  with  all  servers  O(p)  

Scalable  Metadata  Operation  

 Create  request  communicates  with  a  single  server  which  in  turn  communicates  with  other  servers  using  a  tree-­‐based  protocol  O(log  p)  

Mid  

Client  

Serv  

App  

Mid  Mid  

Client  

Serv  

Client  

Serv  

App  

Network  

App  

Mid  

Client  

Serv  

App  

Mid  Mid  

Client  

Serv  

Client  

Serv  

App  

Network  

App  

Page 14: Architecture of a Next-Generation Parallel File System

Recent  Additions  (2.8.5)  

SSD  Metadata  Storage   Replicate  on  Immutable  (file  based)  

Windows  Client  

Supports  Windows  32/64  bit  

Server  2008,  R2,  Vista,  7  

Page 15: Architecture of a Next-Generation Parallel File System

Direct  Access  Interface    (2.8.6)  

•  Implements:  •  POSIX  system  calls  •  Stdio  library  calls  

•  Parallel  extensions  •  Noncontiguous  I/O  •  Non-­‐blocking  I/O  

•  MPI-­‐IO  library  •  Found  more  boundary  

conditions  fixed  in  upcoming  2.8.7  

App  

Kernel  PVFS  lib  

Client  Core  

Direct  lib  

PVFS  lib  Kernel  

App  

IB  

TCP  

Page 16: Architecture of a Next-Generation Parallel File System

File  System  File  System  File  System  

Direct  Interface  Client  Caching    (2.8.6)    

•  Direct  Interface  enables  Multi-­‐Process  Coherent  Client  Caching  for  a  single  client  

File  System  File  System  

Client  Application  

Direct  Interface    

Client  Cache  

File  System  

Page 17: Architecture of a Next-Generation Parallel File System

WebDAV  (2.8.6  webpack)  

PVFS  Protocol  

Orang

eFS  

Apache  

•  Supports  DAV  protocol  and  tested  with  the  Litmus  DAV  test  suite  

•  Supports  DAV  cooperative  locking  in  metadata    

Page 18: Architecture of a Next-Generation Parallel File System

S3  (2.8.6  webpack)  

PVFS  Protocol  

Orang

eFS  

Apache  

•  Tested  using  s3cmd  client  •  Files  accessible  via  other  access  methods  

•  Containers  are  Directories  

•  Accounting  Pieces  not  implimented  

Page 19: Architecture of a Next-Generation Parallel File System

Summary  -­‐  Recently  Added  to  OrangeFS  

•  In  2.8.3  •  Server-­‐to-­‐Server  Communication  •  SSD  Metadata  Storage  •  Replicate  on  Immutable  

•  2.8.4,  2.8.5  (fixes,  support  for  newer  kernels)  •  Windows  Client  •  2.8.6  –  Performance,  Fixes,  IB  updates  

•  Direct  Access  Libraries  (initial  release)  •  preload  library  for  applications,  Including  Optional  Client  

Cache  •  Webpack  

•  WebDAV  (with  file  locking),  S3  

Page 20: Architecture of a Next-Generation Parallel File System

Available  on  the    Amazon  AWS  Marketplace  and  brought  to  you  by  Omnibond  

OrangeFS  Instance  

Unified High Performance File System

DynamoDB  

EBS  Volumes  

OrangeFS  on  AWS  Marketplace  

Page 21: Architecture of a Next-Generation Parallel File System

In  2.8.8  (Just  Released)  

Page 22: Architecture of a Next-Generation Parallel File System

Hadoop  JNI  Interface  (2.8.8)  

•  OrangeFS  Java  Native  Interface  

•  Extension  of  Hadoop  File  System  Class  –>JNI  

•  Buffering  

•  Distribution  

•  Fast  PVFS  Protocol  for  Remote  Configuration  

PVFS  Protocol  

Page 23: Architecture of a Next-Generation Parallel File System

Additional  Items(2.8.8)  •  Updated  user  lib  •  Shared  mmap  support  in  kernel  module  

•  Support  for  kernels  up  to  3.11  

•  Multi-­‐homing  servers  over  IP  •  Clients  can  access  server  over  multiple  interfaces  (say  

clients  on  IPoIB  +clients  on  IPoEthernet  +clients  on  IPoMx  

•  Enterprise  Installers  (Coming  Shortly)  •  Client  (with  DKMS  for  Kernel  Module)  •  Server    •  Devel  

Page 24: Architecture of a Next-Generation Parallel File System

Performance  

Page 25: Architecture of a Next-Generation Parallel File System

Scaling  Tests  

16  Storage  Servers  with  2  LVM’d  5+1  RAID  sets  were  tested  with  up  to  32  clients,  with  read  performance  reaching  nearly  12GB/s  and  write  performance  reaching  nearly  8GB/s.  

Page 26: Architecture of a Next-Generation Parallel File System

MapReduce  over  OrangeFS  

•  8  Dell  R720  Servers  Connected  with  10Gb/s  Ethernet  •  Remote  Case  adds  an  additional  8  Identical  Servers  and  

does  all  OrangeFS  work  Remotely  and  only  Local  work  is  done  on  Compute  Node  (Traditional  HPC  Model)  

•  *25%  improvement  with  OrangeFS  running  Remotely    

Page 27: Architecture of a Next-Generation Parallel File System

MapReduce  over  OrangeFS  

•  8  Dell  R720  Servers  Connected  with  10Gb/s  Ethernet  •  Remote  Clients  are  R720s  with  single  SAS  disks  for  local  

data  (vs.  12  disk  arrays  in  the  previous  test).  

Page 28: Architecture of a Next-Generation Parallel File System

OrangeFS  Clients  

SC13  Demo  Overview    

OrangeFS  Clients  

16  Dell  R720  OrangeFS  Servers  

SC13  Floor  •  Clemson  •  USC  •  I2  •  Omnibond  

I2  Innovation  Platform  100Gb/s  

Page 29: Architecture of a Next-Generation Parallel File System

Sc13  WAN  Performance  

Multiple  Concurrent  Client  File  Creates  over  PVFS  protocol  (nullio)  

Page 30: Architecture of a Next-Generation Parallel File System

For  2.9  (summer  2014)  

Page 31: Architecture of a Next-Generation Parallel File System

Distributed  Directory  Metadata  (2.9.0)  

DirEnt1  

DirEnt2  

DirEnt3  

DirEnt4  

DirEnt5  

DirEnt6  

DirEnt1  DirEnt5  

DirEnt3  

DirEnt2  

DirEnt6  

DirEnt4  

Server0  

Server1  

Server2  

Server3  

Extensible Hashing  

u  State  Management  based  on  Giga+  u  Garth  Gibson,  CMU  

u  Improves  access  times  for  directories  with  a  very  large  number  of  entries  

Page 32: Architecture of a Next-Generation Parallel File System

Cert  or  Credential  

Signed  Capability  I/O  

Signed  Capability  

Signed  Capability  I/O  

Signed  Capability  I/O  

OpenSSL  PKI  

•  3  Security  Modes  •  Basic  –  OrangeFS/PVFS  Classic  Mode  •  Key-­‐Based  –  Keys  are  used  to  authorize  clients  for  use  with  the  FS  •  User  Certificate  Based  with  LDAP  –  user  certs  are  used  for  access  to  

the  file  system  and  are  generated  based  on  LDAP  uid/gid  info  

Page 33: Architecture of a Next-Generation Parallel File System

For  v3  

Page 34: Architecture of a Next-Generation Parallel File System

Replication  /  Redundancy    •  Redundant  Metadata  

•  seamless  recovery  after  a  failure  •  Redundant  objects  from  root  directory  down  •  Configurable  

•  Redundant  Data                    Update  mode  (real  time,  on  close,  on  immutable,  none)                    Configurable  Number  of  Replicas  

•  Real  Time  “forked  flow”  work  shows  little  overhead  •  Replicate  on  Close  •  Replicate  to  external  (like  LTFS)  

•  Looking  at  supporting  HSM  option  to  external            (no  local  replica)  

•  Emphasis  on  continuous  operation  

OrangeFS  3.0  

Page 35: Architecture of a Next-Generation Parallel File System

•  An  OID  (object  identifier)  is  a  128-­‐bit  UUID  that  is  unique  to  the  data-­‐space  

•  An  SID  (server  identifier)  is  a  128-­‐bit  UUID  that  is  unique  to  each  server.  

•  No  more  than  one  copy  of  a  given  data-­‐space  can  exist  on  any  server  

•  The  (OID,  SID)  tuple  is  unique  within  the  file  system.  

•  (OID,  SID1),    (OID,  SID2),  (OID,  SID3)  are  copies  of  the  object  identifier  on  different  servers.  

Handles  -­‐>  UUIDs  OrangeFS  3.0  

Page 36: Architecture of a Next-Generation Parallel File System

•  In  an  Exascale  environment  with  the  potential  for    thousands  of  I/O  servers,  it  will  no  longer  be  feasible  for  each  server  to  know  about  all  other  servers.  

•  Servers  Discovery  •  Will  know  a  subset  of  their  neighbors  at  startup  (or  may  

be  cached  from  previous  startups).      Similar  to  DNS  domains.  

•  Servers  will  learn  about  unknown  servers  on  an  as  needed  basis  and  cache  them.    Similar  to  DNS  query  mechanisms  (root  servers,  authoritative  domain  servers).  

•  SID  Cache,  in  memory  db  to  store  server  attributes  

Server  Location  /  SID  Mgt  OrangeFS  3.0  

Page 37: Architecture of a Next-Generation Parallel File System

Policy  Based  Location  

•  User  defined  attributes  for  servers  and  clients  •  Stored  in  SID  cache  

•  Policy  is  used  for  data  location,  replication  location  and  multi-­‐tenant  support    

•  Completely  Flexible  •  Rack  •  Row  •  App  •  Region    

OrangeFS  3.0  

Page 38: Architecture of a Next-Generation Parallel File System

•  Modular  infrastructure  to  easily  build  background  parallel  processes  for  the  file  system  Used  for:  •    Gathering  Stats  for  Monitoring  •  Usage  Calculation  (can  be  leveraged  for  Directory  

Space  Restrictions,  chargebacks)  •  Background  safe  FSCK  processing  (can  mark  bad  

items  in  MD)  •  Background  Checksum  comparisons    •  Etc…  

 

Background  Parallel  Processing  Infrastructure  (3.0)  

Page 39: Architecture of a Next-Generation Parallel File System

Admin  REST  interface  /  Admin  UI    

PVFS  Protocol  

Orang

eFS  

Apache  

REST

 

(3.0)  

Page 40: Architecture of a Next-Generation Parallel File System

Data  Migration  /  Mgt  

•  Built  on  Redundancy  &  DBG  processes  •  Migrate  objects  between  servers  

•  De-­‐populate  a  server  going  out  of  service  •  Populate  a  newly  activated  server  (HW  lifecycle)  •  Moving  computation  to  data  

•  Hierarchical  storage  •  Use  existing  metadata  services  

•  Possible  -­‐  Directory  Hierarchy  Cloning  •   Copy  on  Write  (Dev,  QA,  Prod  environments  with  high  %  data  

overlap)  

OrangeFS  3.x  

Page 41: Architecture of a Next-Generation Parallel File System

Hierarchical  Data  Management  

Archive  Intermediate  Storage  NFS  

Remote  Systems  exceed,  OSG,  Lustre,  GPFS,    Ceph,  Gluster  

HPC  OrangeFS  

Metadata  OrangeFS  

Users  

OrangeFS  3.x  

Page 42: Architecture of a Next-Generation Parallel File System

Attribute  Based  Metadata  Search    

•  Client  tags  files  with  Keys/Values    •   Keys/Values  indexed  on  Metadata  Servers  •   Clients  query  for  files  based  on  Keys/Values  •   Returns  file  handles  with  options  for  filename  

and  path  

Key/Value  Parallel  Query  

Data  

Data  

File  Access  

OrangeFS  3.x  

Page 43: Architecture of a Next-Generation Parallel File System

Beyond  OrangeFS  NEXT  

Page 44: Architecture of a Next-Generation Parallel File System

Extend  Capability  based  security  

•  Enable  certificate  level  access  (in  process)  •  Federated  access  capable  •  Can  be  integrated  with  rules  based  access  

control  •  Department  x  in  company  y  can  share  with  

Department  q  in  company  z    •  rules  and  roles  establish  the  relationship  •  Each  company  manages  their  own  control  of  who  is  in  

the  company  and  in  department  

Page 45: Architecture of a Next-Generation Parallel File System

SDN  -­‐  OpenFlow  

•  Working  with  OF  research  team  at  CU  •  OF  separates  the  control  plane  from  delivery,  

gives  ability  to  control  network  with  SW  •  Looking  at  bandwidth  optimization  

leveraging  OF  and  OrangeFS  

Page 46: Architecture of a Next-Generation Parallel File System

ParalleX  

ParalleX  is  a  new  parallel  execution  model  •  Key  components  are:  

•  Asynchronous  Global  Address  Space  (AGAS)  •  Threads  •  Parcels  (message  driven  instead  of  message  passing)  •  Locality    •  Percolation    •  Synchronization  primitives  •  High  Performance  ParalleX  (HPX)  library  

implementation  written  in  C++  

Page 47: Architecture of a Next-Generation Parallel File System

PXFS  

•  Parallel  I/O  for  ParalleX  based  on  PVFS  •  Common  themes  with  OrangeFS  Next  

•  Primary  objective:  unification  of  ParalleX  and  storage  name  spaces.  

•  Integration  of  AGAS  and  storage  metadata  subsystems  •  Persistent  object  model  

•  Extends  ParalleX  with  a  number  of  IO  concepts  •  Replication  •  Metadata  

•  Extending  IO  with  ParalleX  concepts  •  Moving  work  to  data  •  Local  synchronization  

•  Effort  with  LSU,  Clemson,  and  Indiana  U.  •  Walt  Ligon,  Thomas  Sterling  

Page 48: Architecture of a Next-Generation Parallel File System

Community  

Page 49: Architecture of a Next-Generation Parallel File System

Johns  Hopkins  OrangeFS  Selection  

•  JHU  -­‐  HLTCOE  Selected  OrangeFS  •  After  evaluating:  Ceph,  GlusterFS,  Lustre  and  

OrangeFS  “Leveraging  OrangeFS  for  the  parallel  filesystem,  the  system  as  a  whole  is  capable  of  delivering  30GB/s  write,  46GB/s  read,  and  be-­‐  tween  37,260-­‐237,180  IOPS  of  performance.  The  variation  in  IOPS  performance  is  dependent  on  the  file  size  and  number  of  bytes  written  per  commit  as  documented  in  the  Test  Results  section.”*  

*  http://hltcoe.jhu.edu/uploads/publications/papers/14662_slides.pdf  

“The  final  system  design  rep-­‐  resents  a  2,775%  increase  in  read  performance  and  a  1,763-­‐11,759%  increase  in  IOPS”*  

Page 50: Architecture of a Next-Generation Parallel File System

Learning  More  

•  www.orangefs.org  web  site  •  Releases  •  Documentation  •  Wiki  

•  pvfs2-­‐users@beowulf-­‐underground.org  •  Support  for  users  

•  pvfs2-­‐developers@beowulf-­‐underground.org  •  Support  for  developers  

Page 51: Architecture of a Next-Generation Parallel File System

Support  &  Development  Services  

•  www.orangefs.com  &  www.omnibond.com  •  Professional  Support  &  Development  team  •  Buy  into  the  project    

Page 52: Architecture of a Next-Generation Parallel File System

 Intelligent  

Transportation  Solutions  

 Identity  Manager  Drivers  &  Sentinel  

Connectors  

 

 Parallel  Scale-­‐Out  Storage  Software  

 Social  Media  

Interaction  System  

Omnibond  Info  

Computer  Vision   Enterprise   Personal  

Solution  Areas  

Page 53: Architecture of a Next-Generation Parallel File System

Insert  Discussion  Here