38
www.clusterstor.com ClusterStor Exascale File Systems Scalability in ClusterStor’s Colibri System Peter Braam Mar 15, 2010

Exascale File Systems - Teratec€¦ · Content A few words about Lustre System overview Management Availability I/O Scalable communications

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Exascale File Systems - Teratec€¦ · Content A few words about Lustre System overview Management Availability I/O Scalable communications

www.clusterstor.com ClusterStor  

Exascale File Systems Scalability in ClusterStor’s Colibri System

Peter  Braam  Mar  15,  2010  

Page 2: Exascale File Systems - Teratec€¦ · Content A few words about Lustre System overview Management Availability I/O Scalable communications

Content

  A few words about Lustre   System overview   Management   Availability   I/O   Scalable communications

Q2 2010 ClusterStor Colibri Exascale Architecture 2

Page 3: Exascale File Systems - Teratec€¦ · Content A few words about Lustre System overview Management Availability I/O Scalable communications

Lustre  

Q2 2010 ClusterStor Colibri Exascale Architecture 3

Page 4: Exascale File Systems - Teratec€¦ · Content A few words about Lustre System overview Management Availability I/O Scalable communications

Lustre  

  I  started  this  in  1999,  le9  Sun  late  2008    A  few  ques@ons  

  How  successful  is  Lustre?   Why  not  evolve  Lustre  to  exascale  system?   What  key  differences  to  expect?  

Q2 2010 ClusterStor Colibri Exascale Architecture 4

Page 5: Exascale File Systems - Teratec€¦ · Content A few words about Lustre System overview Management Availability I/O Scalable communications

Lustre  is  doing  well:    Top  500  June  2010  

  59/100  run  Lustre    22/100  run  GPFS    3/100  run  Panasas    1/100  run  CXFS    15  undetermined  

Page 6: Exascale File Systems - Teratec€¦ · Content A few words about Lustre System overview Management Availability I/O Scalable communications

Source: Slide 3 of E. Barton presentation at LUG 2010

Page 7: Exascale File Systems - Teratec€¦ · Content A few words about Lustre System overview Management Availability I/O Scalable communications

Some  History  

  DOE:  paid  much,  provided  requirements,  QA    Version  interoperability  key  to  increasing  cost    Lack  of  focus  on  QA  is  biggest  regret  

Event   Release   Comments  

1999   Lustre  Project  Started  @CMU  

2001   Cluster  File  Systems  founded  

2003   Dec:  Lustre  1.0   ~$3.8M  to  develop  1.0  

2004   Nov:  Lustre  on  #1  system  (BGL)   Apr:  Lustre  1.2  Nov:  Lustre  1.4  

~$5.9M  to  get  to  #1  

2007   250  paying  customers,  25  OEMs   Apr:  Lustre  1.6    ~$10M  for  1.4  to  1.6  upgrade  

2009   May:  Lustre  1.8   2  more  years  of  R&D  for  1.6  to  1.8  

Page 8: Exascale File Systems - Teratec€¦ · Content A few words about Lustre System overview Management Availability I/O Scalable communications

Key  differences  Colibri  -­‐  Lustre  

  Focus  on  quality,  diagnos@cs,  usability    U@lize  flash  storage    Stop  following  40  year  old  UFS  paradigms  

  Use  embedded  databases    Introduce  powerful  interfaces  

  No  dependence  on  RAID  storage,  failover  

Q2 2010 ClusterStor Colibri Exascale Architecture 8

Page 9: Exascale File Systems - Teratec€¦ · Content A few words about Lustre System overview Management Availability I/O Scalable communications

System  overview  

Q2 2010 ClusterStor Colibri Exascale Architecture 9

Page 10: Exascale File Systems - Teratec€¦ · Content A few words about Lustre System overview Management Availability I/O Scalable communications

C2  MDS  

disk  disk  C2  MDS  

C2  MDS   …

1,000s of servers

3rd party storage: RAID arrays, JBOD, Internal storage, with or without flash

redundant storage C2    DS   C2  DS  C2      DS  

Flash based proxies

disk   disk   disk  disk   disk  

C2  Proxy      MDS  

Client   Client   Client   100,000s of clients … Client  Client  

C2    DS  

C2  DS  

C2        DS  

C2  Proxy    DS  

C2  Proxy      DS  

flash

flash

Colibri  (aka  C2)  deployment  

Q2 2010 ClusterStor Colibri Exascale Architecture 10

•  Complete file service solution •  Clients have “cluster intelligence” •  Scales enormously (HPCS and beyond)

Page 11: Exascale File Systems - Teratec€¦ · Content A few words about Lustre System overview Management Availability I/O Scalable communications

A  few  elements  to  start  

  Object  store  is  used  for  everything  

 Metadata  is  in  embedded  databases    Schema  breaks  away  from  UFS  (finally)  

  PCI  flash  storage:  integral  design  element  

  Architecture  should  scale  to  100  PF  range    A9er  that  I  see  too  many  disks  again….  

Q2 2010 ClusterStor Colibri Exascale Architecture 11

Page 12: Exascale File Systems - Teratec€¦ · Content A few words about Lustre System overview Management Availability I/O Scalable communications

Management  

Q2 2010 ClusterStor Colibri Exascale Architecture 12

Page 13: Exascale File Systems - Teratec€¦ · Content A few words about Lustre System overview Management Availability I/O Scalable communications

Management  

  What  was  learned  in  the  past?    Observing  a  cluster  is  crucial  

  FOL:  file  opera@ons  log    ADDB:  analysis  and  diagnos@cs  database    A  data  mining  &  visualiza@on  tool  

  Simula@ons    Simple  language  expressing  I/O  paqerns    Simulator  that  processes  the  FOL  traces    Editor  for  FOL  for  simula@ons    Server  load  simula@ons  

Q2 2010 ClusterStor Colibri Exascale Architecture 13

Page 14: Exascale File Systems - Teratec€¦ · Content A few words about Lustre System overview Management Availability I/O Scalable communications

FOL  

  File  Opera@on  Log    Contains  FOP  

  packets    Transac@onal    Complete    Useful  for  data  mgmt  

ADDB  

  For  each  FOL  record    One  ADDB  record  

  Contains    Analysis  &  diagnos@cs  data  

  Eg:    Which  client  did  what?    How  long  did  it  take    What  network?  

  Cache  hits  /  disk  queues  

ADDB  &  FOL  

Q2 2010 ClusterStor Colibri Exascale Architecture 14

Page 15: Exascale File Systems - Teratec€¦ · Content A few words about Lustre System overview Management Availability I/O Scalable communications

Example  -­‐  Simula@on  Environment  

Q2 2010 ClusterStor Colibri Exascale Architecture 15

jobs

FOL + ADDB

Colibri Server

search analyze

simulated original + simulated modified

visualize difference Visualize original + modified

modeled performance

Client   Client   Client  …

visualize

FOL + ADDB

Visualize real performance vs modeled performance

CSIM  

Page 16: Exascale File Systems - Teratec€¦ · Content A few words about Lustre System overview Management Availability I/O Scalable communications

Failures  &  availability  

Q2 2010 ClusterStor Colibri Exascale Architecture 16

Page 17: Exascale File Systems - Teratec€¦ · Content A few words about Lustre System overview Management Availability I/O Scalable communications

  Failures  will  be  common      in  very  large  systems  

  Failover    Wait  for  resource  

  Doesn’t  work  well    Focus  on  availability  

  No  reply  (failure,  load)    Adapt  layout    Asynchronous  cleanup  

No  failover  

Q2 2010 ClusterStor Colibri Exascale Architecture 17

C2        DS1  

Client  

C2        DS3  C2        DS2  

write

C2        DS3  

MDS  

New Layout

C2        DS1  

Client  

C2        DS3  C2        DS2  

3. Client writes using new layout

Possible asynchronous rebuild

C2        DS3  

2. Client changes layout

1. Client sees failure

Client  

Page 18: Exascale File Systems - Teratec€¦ · Content A few words about Lustre System overview Management Availability I/O Scalable communications

Metadata  -­‐  Data  Layout  

  Almost  all  layouts  will  be  regular    Use  formulas,  do  not  enumerate  objects  &  devices    Use  references  between  metadata  tables  

  Avoid  mul@ple  copies    

  A9er  failures,  data  layout  becomes  complex    Failures  can  move  1000s  of  different  extents  in  a  file    May  clean  this  up  asynchronously    

  FOL  knows  about  it  

Q2 2010 ClusterStor Colibri Exascale Architecture 18

Page 19: Exascale File Systems - Teratec€¦ · Content A few words about Lustre System overview Management Availability I/O Scalable communications

Data  placement  

  Data  layout  is  central  in  architecture    Redundancy  across  the  servers  in  network    RAID  in  data  servers  has  liqle  value  

  Parity  de-­‐clustered  in  network    All  drives:  

 Contain  some  data  of  objects  

 Contain  some  spare  space  

  1-­‐5  minute  restore  of  redundancy  for  failed  drive   with  bandwidth  of  cluster  

Q2 2010 ClusterStor Colibri Exascale Architecture 19

Page 20: Exascale File Systems - Teratec€¦ · Content A few words about Lustre System overview Management Availability I/O Scalable communications

Availability  

  System  @  full  bandwidth  99.97%  of  the  @me    Disk  rebuilds  are  not  the  problem  

  Server  losses  are  a  problem  

  Systems  must  have  network  redundancy    AKA  server  network  striping  

  No  purpose  for  shared  storage  

Q2 2010 ClusterStor Colibri Exascale Architecture 20

Page 21: Exascale File Systems - Teratec€¦ · Content A few words about Lustre System overview Management Availability I/O Scalable communications

Caches  and  scale  

Q2 2010 ClusterStor Colibri Exascale Architecture 21

Page 22: Exascale File Systems - Teratec€¦ · Content A few words about Lustre System overview Management Availability I/O Scalable communications

Flash,  cache  &  scale  

  {Flash,Disk}  based  servers  =  {FBS,  DBS}    2010  

  Bandwidth:  FBS  $/GB/sec    ==  0.25    DBS  $/GB/sec    Capacity:  FBS  $/GB  ==  5x  DBS  $/GB    FBS  form  factor  –>  PCI  flash  becomes  commodity  

  Choose      FBS  for  bandwidth,  capacity  ~  RAM  of  cluster  

  FBS  bandwidth  high  enough  for  e.g.  fast  checkpoint    DBS  Bandwidth  =  0.2  FBS  Bandwidth  

Q2 2010 ClusterStor Colibri Exascale Architecture 22

Page 23: Exascale File Systems - Teratec€¦ · Content A few words about Lustre System overview Management Availability I/O Scalable communications

Containers  –  forwarding  caches  

  Containers  are  like  a  logical  volume    Can  contain    

  Metadata  objects    File  fragments    Files  

  New  feature  –  “include”  opera@on    a  container  moves  into  a  larger  container    “include”  does  not  iterate  over  contents  

  Metadata  is  more  complex    Not  problema@c,  we  use  an  embedded  DB  anyway  

Q2 2010 ClusterStor Colibri Exascale Architecture 23

Page 24: Exascale File Systems - Teratec€¦ · Content A few words about Lustre System overview Management Availability I/O Scalable communications

Client  

Client  

Data  Servers  

Client streams container to server as a single block of data

Client creates files, writes to them, using local container

Client gets range of unallocated inodes

Client  

MDS  

Container  

MDS  

Container  streaming  

Q2 2010 ClusterStor Colibri Exascale Architecture 24

Page 25: Exascale File Systems - Teratec€¦ · Content A few words about Lustre System overview Management Availability I/O Scalable communications

Several  uses  for  containers  

  WB  caches  in  clients    Advantage  –  no  itera@on  over  contents  

  Networked  caches  to  act  as  I/O  capacitor    Get  flash  for  bandwidth,    

  Capacity  ~  RAM  of  en@re  cluster  

  Get  disk  for  capacity    Intermediary  storage,  e.g.  I/O  forwarding  

  Physically  non-­‐con@guous  container  with  small  stuff    Streams  to  con@guous  container  inside  a  bigger  one  

  Data  management  

Q2 2010 ClusterStor Colibri Exascale Architecture 25

Page 26: Exascale File Systems - Teratec€¦ · Content A few words about Lustre System overview Management Availability I/O Scalable communications

Oh,  you  need  to  read?  

  Physics  –  disks  are  slow    Two  common  cases:  

1.  Everyone  reads  the  same  

2.  Everyone  reads  something  different  

  Case  2  is  best  handled  as  follows:    Use  FOL  or  otherwise  track  what  files  will  be  needed  

  Pre-­‐stage  them  in  flash  

  Case  1  is  addressed  with  scalable  comms  

Q2 2010 ClusterStor Colibri Exascale Architecture 26

Page 27: Exascale File Systems - Teratec€¦ · Content A few words about Lustre System overview Management Availability I/O Scalable communications

Scalable  communica@ons  

Q2 2010 ClusterStor Colibri Exascale Architecture 27

Page 28: Exascale File Systems - Teratec€¦ · Content A few words about Lustre System overview Management Availability I/O Scalable communications

Colibri  resources  

  Colibri  manages  resources    Locks,  file  layouts,  grants,  quota,  credits  ….    Even  inodes  and  data  are  resources  

  Resource  communica@ons  need  to  scale    E.g.  a  million  processes  may  need  to  agree  on  an  adapted  file  layout  

  All  nodes  need  to  read  the  same  data  

  Hierarchical  flood-­‐fill  mechanism  (not  new)  

Q2 2010 ClusterStor Colibri Exascale Architecture 28

Page 29: Exascale File Systems - Teratec€¦ · Content A few words about Lustre System overview Management Availability I/O Scalable communications

network  request  scheduler  

request  handler  

meta-­‐data  back-­‐end   data  back-­‐end   remote  back-­‐end  

object  containers  

cache  pressure  

resources  

FOP

C2  Clients  

C2  Client  

FOP to other servers

Wire Protocols

API Calls

Legend  

Architecture  decomposi@on  

Q2 2010 ClusterStor Colibri Exascale Architecture 29

Page 30: Exascale File Systems - Teratec€¦ · Content A few words about Lustre System overview Management Availability I/O Scalable communications

Data Servers

Metadata Servers

Hi  Speed  net  

Logical Organization for Resource Management

Physical Organization of C2 Cluster

Scalable  communica@ons  

Q2 2010 ClusterStor Colibri Exascale Architecture 30

Page 31: Exascale File Systems - Teratec€¦ · Content A few words about Lustre System overview Management Availability I/O Scalable communications

Use  case  example  –  HDF5  file  I/O  @100PF  

Q2 2010 ClusterStor Colibri Exascale Architecture 31

Page 32: Exascale File Systems - Teratec€¦ · Content A few words about Lustre System overview Management Availability I/O Scalable communications

Lessons  from  benchmarking  

  1  TB  FATSAS  drives  (Seagate  Barracuda)    80  MB/sec  bandwidth  with  cache  off  

  4MB  alloca@on  unit  is  op@mal  

  PCI  flash  and  NFSv4.1  RPC  system    10  Gige  IPoIB  connec@on    DB4  backend    100K  transac@ons  /  sec  aggregate,  sustained  

 Update  2  tables  and  using  transac@on  log   One  server  

ClusterStor, Inc 2010-Q2 32 Confidential Information - NDA only

Page 33: Exascale File Systems - Teratec€¦ · Content A few words about Lustre System overview Management Availability I/O Scalable communications

100  PF  system  

  Requirements    100  PB  capacity    1  trillion  files    10  TB/sec  IO  throughput    Checkpoint  or  result  dumps  of  3  PB  

  10M  MDS  opera@ons  /  sec  

ClusterStor, Inc 2010-Q2 33 Confidential Information - NDA only

Page 34: Exascale File Systems - Teratec€¦ · Content A few words about Lustre System overview Management Availability I/O Scalable communications

100  PF  Solu@on  

  500  servers,  each  ac@ng  as  MDS  and  DS    Disk  capacity  500  x  8TB  x  40  dr  =  160  PB  raw  

  BW  ~  20,000  x    80  MB/sec  =  1.6  TB  /sec    

  PCI  flash      capacity  500  x  6  TB  =  3  PB    BW/node:  25  GB/sec,  aggregate:  12.5  TB/sec  

  Network  4x  EDR  IB  –  effec@ve  BW  25  GB/sec    MD  throughput  aggregate:  50M  trans  /  sec  

  1  copy  of  MD  remains  in  flash    10^12  inodes  x  150  B  =  150  TB,  or  5%  of  flash  

ClusterStor, Inc 2010-Q2 34 Confidential Information - NDA only

Page 35: Exascale File Systems - Teratec€¦ · Content A few words about Lustre System overview Management Availability I/O Scalable communications

HDF5  file  I/O  –  use  case  

  Servers  detect  ongoing  small  I/O  on  part  of  a  file  

  It  chooses  to  migrate  a  sec@on  of  the  file  and  the  file  alloca@on  data  into  a  container  stored  on  flash    This  involves  revoking  the  file  layout  from  all  clients  and  sending  the  new  layout  (scalable  comms)  

  During  migra@on,  small  I/O  stops  briefly  

  Now  100K  iops  are  available  to  flash    When  file  is  quiescent,  data  migrates  back  

Q2 2010 ClusterStor Colibri Exascale Architecture 35

Page 36: Exascale File Systems - Teratec€¦ · Content A few words about Lustre System overview Management Availability I/O Scalable communications

About  ClusterStor  

Q2 2010 ClusterStor Colibri Exascale Architecture 36

Page 37: Exascale File Systems - Teratec€¦ · Content A few words about Lustre System overview Management Availability I/O Scalable communications

A  few  other  notes…  

  ClusterStor  market  focus  is  “large  data”  

  Colibri  will  be  delivered  in  increments  

  The  4th  increment  is  the  exascale  file  system  

  The  2nd  is  a  pNFS  server    The  0th,  1st,  3rd    and  5th  are  a  secret    

Q2 2010 ClusterStor Colibri Exascale Architecture 37

Page 38: Exascale File Systems - Teratec€¦ · Content A few words about Lustre System overview Management Availability I/O Scalable communications

Thank  you!  

Q2 2010 ClusterStor Colibri Exascale Architecture 38