49
1 What’s new and upcoming in HDFS January 30, 2013 Todd Lipcon, SoAware Engineer [email protected] @tlipcon

What's New and Upcoming in HDFS - the Hadoop Distributed File System

Embed Size (px)

Citation preview

1

What’s  new  and  upcoming  in  HDFS  January  30,  2013  Todd  Lipcon,  SoAware  Engineer  [email protected]      @tlipcon  

IntroducGons  

•  SoAware  engineer  on  Cloudera’s  Storage  Engineering  team  

•  CommiIer  and  PMC  Member  for  Apache  Hadoop  and  Apache  HBase  

•  Projects  in  2012  •  Responsible  for  >50%  of  the  code  for  all  phases  of  HA  development  

•  Also  worked  on  many  performance  and  stability  improvements  

•  This  presentaGon  is  highly  technical  –  please  feel  free  to  grab/email  me  later  if  you’d  like  to  clarify  anything!  

2 ©2013 Cloudera, Inc. All Rights Reserved.

Outline  

©2013 Cloudera, Inc. All Rights Reserved. 3  

• HDFS  2.0  –  what’s  new  in  2012?  • HA  Phase  1  (Q1  2012)  • HA  Phase  2  (Q2-­‐Q4  2012)  •  Performance  improvements  and  other  new  features  

• What’s  coming  in  2013?  • HDFS  Snapshots  •  BeIer  storage  density  and  file  formats  •  Caching  and  Hierarchical  Storage  Management  

4

HDFS-­‐1623:  completed  March  2012  

HDFS  HA  Phase  1  Review  

HDFS  HA  Background  

•  HDFS’s  strength  is  its  simple  and  robust  design  •  Single  master  NameNode  maintains  all  metadata  •  Scales  to  mul4-­‐petabyte  clusters  easily  on  modern  hardware  

•  TradiGonally,  the  single  master  was  also  a  single  point  of  failure  

•  Generally  good  availability,  but  not  ops-­‐friendly  •  No  hot  patch  ability,  no  hot  reconfiguraGon  •  No  hot  hardware  replacement  

•  Hadoop  is  now  mission  cri4cal:  SPOF  not  OK!  

5 ©2013 Cloudera, Inc. All Rights Reserved.

HDFS  HA  Development  Phase  1  

•  Completed  March  2012  (HDFS-­‐1623)  •  Introduced  the  StandbyNode,  a  hot  backup  for  the  HDFS  NameNode.  

•  Relied  on  shared  storage  to  synchronize  namespace  state  •  (e.g.  a  NAS  filer  appliance)  

•  Allowed  operators  to  manually  trigger  failover  to  the  Standby  

•  Sufficient  for  many  HA  use  cases:  avoided  planned  down4me  for  hardware  and  soAware  upgrades,  planned  machine/OS  maintenance,  configuraGon  changes,  etc.  

6 ©2013 Cloudera, Inc. All Rights Reserved.

HDFS  HA  Architecture  Phase  1  

•  Parallel  block  reports  sent  to  AcGve  and  Standby  NameNodes  

•  NameNode  state  shared  by  locaGng  edit  log  on  NAS  over  NFS  

•  AcGve  NameNode  writes  while  Standby  Node  “tails”    

•  Client  failover  done  via  client  configuraGon  •  Each  client  configured  with  the  address  of  both  NNs:  try  both  to  find  acGve  

7 ©2013 Cloudera, Inc. All Rights Reserved.

HDFS  HA  Architecture  Phase  1  

©2013 Cloudera, Inc. All Rights Reserved. 8  

Fencing  and  NFS  

• Must  avoid  split-­‐brain  syndrome  •  Both  nodes  think  they  are  acGve  and  try  to  write  to  the  same  edit  log.  Your  metadata  becomes  corrupt  and  requires  manual  intervenGon  to  restart  

•  Configure  a  fencing  script  •  Script  must  ensure  that  prior  acGve  has  stopped  wriGng  •  STONITH:  shoot-­‐the-­‐other-­‐node-­‐in-­‐the-­‐head  •  Storage  fencing:  e.g  using  NetApp  ONTAP  API  to  restrict  filer  access  

•  Fencing  script  must  succeed  to  have  a  successful  failover  

9 ©2013 Cloudera, Inc. All Rights Reserved.

Shortcomings  of  Phase  1  

•  Insufficient  to  protect  against  unplanned  down4me  •  Manual  failover  only:  requires  an  operator  to  step  in  quickly  aAer  a  crash  

•  Various  studies  indicated  this  was  the  minority  of  downGme,  but  sGll  important  to  address  

•  Requirement  of  a  NAS  device  made  deployment  complex,  expensive,  and  error-­‐prone  

(we  always  knew  this  was  just  the  first  phase!)  

10 ©2013 Cloudera, Inc. All Rights Reserved.

HDFS  HA  Development  Phase  2  

• MulGple  new  features  for  high  availability  •  Automa4c  failover,  based  on  Apache  ZooKeeper  •  Remove  dependency  on  NAS  (network-­‐aIached  storage)  

•  Address  new  HA  use  cases  •  Avoid  unplanned  downGme  due  to  soAware  or  hardware  faults  

•  Deploy  in  filer-­‐less  environments  •  Completely  stand-­‐alone  HA  with  no  external  hardware  or  soAware  dependencies  

•  no  Linux-­‐HA,  filers,  etc  

11 ©2013 Cloudera, Inc. All Rights Reserved.

12

HDFS-­‐3042:  completed  May  2012  

AutomaGc  Failover  Overview  

AutomaGc  Failover  Goals  

•  Automa4cally  detect  failure  of  the  AcGve  NameNode  •  Hardware,  soAware,  network,  etc.  

•  Do  not  require  operator  interven4on  to  iniGate  failover  

•  Once  failure  is  detected,  process  completes  automaGcally  •  Support  manually  ini4ated  failover  as  first-­‐class  

•  Operators  can  sGll  trigger  failover  without  having  to  stop  AcGve  

•  Do  not  introduce  a  new  SPOF  •  All  parts  of  auto-­‐failover  deployment  must  themselves  be  HA  

13 ©2013 Cloudera, Inc. All Rights Reserved.

AutomaGc  Failover  Architecture  

©2013 Cloudera, Inc. All Rights Reserved. 14  

• AutomaGc  failover  requires  ZooKeeper  • Not  required  for  manual  failover  

• ZK  makes  it  easy  to:  • Detect  failure  of  AcGve  NameNode  • Determine  which  NameNode  should  become  the  AcGve  NN  

AutomaGc  Failover  Architecture  

©2013 Cloudera, Inc. All Rights Reserved. 15  

• New  daemon:  ZooKeeper  Failover  Controller  (ZKFC)  

•  In  an  auto  failover  deployment,  run  two  ZKFCs  • One  per  NameNode,  on  that  NameNode  machine  

• ZKFC  has  three  simple  responsibili4es:  • Monitors  health  of  associated  NameNode  • ParGcipates  in  leader  elec4on  of  NameNodes  • Fences  the  other  NameNode  if  it  wins  elecGon  

AutomaGc  Failover  Architecture  

©2013 Cloudera, Inc. All Rights Reserved. 16  

17

HDFS-­‐3077:  completed  October  2012  

Removing  the  NAS  dependency  

Shared  Storage  in  HDFS  HA  

•  The  Standby  NameNode  synchronizes  the  namespace  by  following  the  AcGve  NameNode’s  transacGon  log  

•  Each  operaGon  (eg  mkdir(/foo))  is  wriIen  to  the  log  by  the  AcGve  

•  The  StandbyNode  periodically  reads  all  new  edits  and  applies  them  to  its  own  metadata  structures  

•  Reliable  shared  storage  is  required  for  correct  opera4on  

•  In  phase  1,  shared  storage  was  synonymous  with  NFS-­‐mounted  NAS  

18 ©2013 Cloudera, Inc. All Rights Reserved.

Shortcomings  of  NFS-­‐based  approach  

•  Custom  hardware  •  Lots  of  our  customers  don’t  have  SAN/NAS  available  in  their  datacenters  

•  Costs  money,  Gme  and  experGse  •  Extra  “stuff”  to  monitor  outside  HDFS  •  We  just  moved  the  SPOF,  didn’t  eliminate  it!  

•  Complicated  •  Storage  fencing,  NFS  mount  opGons,  mulGpath  networking,  etc  •  OrganizaGonally  complicated:  dependencies  on  storage  ops  team  

•  NFS  issues  •  Buggy  client  implementaGons,  liIle  control  over  Gmeout  behavior,  etc  

19 ©2013 Cloudera, Inc. All Rights Reserved.

Primary  Requirements  for  Improved  Storage  

•  No  special  hardware  (PDUs,  NAS)  •  No  custom  fencing  configuraGon  

•  Too  complicated  ==  too  easy  to  misconfigure  

•  No  SPOFs  •  punGng  to  filers  isn’t  a  good  opGon  •  need  something  inherently  distributed  

©2013 Cloudera, Inc. All Rights Reserved. 20

Secondary  Requirements  

•  Configurable  degree  of  fault  tolerance  •  Configure  N  nodes  to  tolerate  (N-­‐1)/2  

• Making  N  bigger  (within  reasonable  bounds)  shouldn’t  hurt  performance.  Implies:  

•  Writes  done  in  parallel,  not  pipelined  •  Writes  should  not  wait  on  slowest  replica  

•  Locate  replicas  on  exisGng  hardware  investment  (eg  share  with  JobTracker,  NN,  SBN)  

©2013 Cloudera, Inc. All Rights Reserved. 21

OperaGonal  Requirements  

•  Should  be  operable  by  exisGng  Hadoop  admins.  Implies:  

•  Same  metrics  system  (“hadoop  metrics”)  •  Same  configuraGon  system  (xml)  •  Same  logging  infrastructure  (log4j)  •  Same  security  system  (Kerberos-­‐based)  

•  Allow  exisGng  ops  to  easily  deploy  and  manage  the  new  feature  

•  Allow  exisGng  Hadoop  tools  to  monitor  the  feature  •  (eg  Cloudera  Manager,  Ganglia,  etc)  

©2013 Cloudera, Inc. All Rights Reserved. 22

Our  soluGon:  QuorumJournalManager  

•  QuorumJournalManager  (client)  •  Plugs  into  JournalManager  abstracGon  in  NN  (instead  of  exisGng  FileJournalManager)  

•  Provides  edit  log  storage  abstracGon  

•  JournalNode  (server)  •  Standalone  daemon  running  on  an  odd  number  of  nodes  •  Provides  actual  storage  of  edit  logs  on  local  disks  •  Could  run  inside  other  daemons  in  the  future  

 

©2013 Cloudera, Inc. All Rights Reserved. 23

Architecture  

©2013 Cloudera, Inc. All Rights Reserved. 24

Commit  protocol  

•  NameNode  accumulates  edits  locally  as  they  are  logged  

•  On  logSync(),  sends  accumulated  batch  to  all  JNs  via  Hadoop  RPC  

• Waits  for  success  ACK  from  a  majority  of  nodes  •  Majority  commit  means  that  a  single  lagging  or  crashed  replica  does  not  impact  NN  latency  

•  Latency  @  NN  =  median(Latency  @  JNs)  

•  Uses  the  well-­‐known  Paxos  algorithm  to  perform  recovery  of  any  in-­‐flight  edits  on  leader  switchover  

©2013 Cloudera, Inc. All Rights Reserved. 25

JN  Fencing  

•  How  do  we  prevent  split-­‐brain?  •  Each  instance  of  QJM  is  assigned  a  unique  epoch  number  

•  provides  a  strong  ordering  between  client  NNs  •  Each  IPC  contains  the  client’s  epoch  •  JN  remembers  on  disk  the  highest  epoch  it  has  seen  •  Any  request  from  an  earlier  epoch  is  rejected.  Any  from  a  newer  one  is  recorded  on  disk  

•  Distributed  Systems  folks  may  recognize  this  technique  from  Paxos  and  other  literature  

©2013 Cloudera, Inc. All Rights Reserved. 26

Fencing  with  epochs  

•  Fencing  is  now  implicit  •  The  act  of  becoming  acGve  causes  any  earlier  acGve  NN  to  be  fenced  out  

•  Since  a  quorum  of  nodes  has  accepted  the  new  acGve,  any  other  IPC  by  an  earlier  epoch  number  can’t  get  quorum  

•  Eliminates  confusing  and  error-­‐prone  custom  fencing  configura4on  

©2013 Cloudera, Inc. All Rights Reserved. 27

Other  implementaGon  features  

•  Hadoop  Metrics  •  lag,  percenGle  latencies,  etc  from  perspecGve  of  JN,  NN  •  metrics  for  queued  txns,  %  of  Gme  each  JN  fell  behind,  etc,  to  help  suss  out  a  slow  JN  before  it  causes  problems  

•  Security  •  full  Kerberos  and  SSL  support:  edits  can  be  opGonally  encrypted  in-­‐flight,  and  all  access  is  mutually  authenGcated  

©2013 Cloudera, Inc. All Rights Reserved. 28

TesGng  

•  Randomized  fault  test  •  Runs  all  communicaGons  in  a  single  thread  with  determinisGc  order  and  fault  injecGons  based  on  a  seed  

•  Caught  a  number  of  really  subtle  bugs  along  the  way  •  Run  as  an  MR  job:  5000  fault  tests  in  parallel  •  MulGple  CPU-­‐years  of  stress  tesGng:  found  2  bugs  in  JeIy!    

•  Cluster  tesGng:  100-­‐node,  MR,  HBase,  Hive,  etc  •  Commit  latency  in  pracGce:  within  same  range  as  local  disks  (beIer  than  one  of  two  local  disks,  worse  than  the  other  one)  

©2013 Cloudera, Inc. All Rights Reserved. 30

Deployment  

• Most  customers  running  3  JNs  (tolerate  1  failure)  •  1  on  NN,  1  on  SBN,  1  on  JobTracker/ResourceManager  •  OpGonally  run  2  more  (eg  on  basGon/gateway  nodes)  to  tolerate  2  failures  

•  No  new  hardware  investment    

•  Refer  to  docs  for  detailed  configuraGon  info  

©2013 Cloudera, Inc. All Rights Reserved. 31

Status  

• Merged  into  Hadoop  development  trunk  in  early  October  

•  Available  in  CDH4.1,  will  be  in  upcoming  Hadoop  2.1  •  Deployed  at  several  customer/community  sites  with  good  success  so  far  (no  lost  data)  

•  In  contrast,  we’ve  had  several  issues  with  misconfigured  NFS  filers  causing  downGme  

•  Highly  recommend  you  use  Quorum  Journaling  instead  of  NFS!  

©2013 Cloudera, Inc. All Rights Reserved. 32

Summary  of  HA  Improvements  

•  Run  an  acGve  NameNode  and  a  hot  Standby  NameNode  

•  AutomaGcally  triggers  seamless  failover  using  Apache  ZooKeeper  

•  Stores  shared  metadata  on  QuorumJournalManager:  a  fully  distributed,  redundant,  low  latency  journaling  system.  

•  All  improvements  available  now  in  HDFS  branch-­‐2  and  CDH4.1  

©2013 Cloudera, Inc. All Rights Reserved. 33

34

HDFS  Performance  Update  

Performance  Improvements  (overview)  

•  Several  improvements  made  for  Impala  •  Much  faster  libhdfs  •  APIs  for  spindle-­‐based  scheduling  

•  Other  more  general  improvements  (especially  for  HBase  and  Accumulo)  

•  Ability  to  read  directly  from  block  files  in  secure  environments  

•  Ability  for  applicaGons  to  perform  their  own  checksums  and  eliminate  IOPS  

©2013 Cloudera, Inc. All Rights Reserved. 35

•  This  can  also  benefit  apps  like  HBase,  Accumulo,  and  MR  with  a  bit  more  work  (TBD  in  2013)  

libhdfs  “direct  read”  support  (HDFS-­‐2834)  

36  

Disk  locaGons  API  (HDFS-­‐3672)  

•  HDFS  has  always  exposed  node  locality  informaGon  •  Map<Block,  List<Datanode  Addresess>>  

•  Now  also  can  expose  disk  locality  informaGon  •  Map<Replica,  List<Spindle  IdenGfiers>>  

•  Impala  uses  this  API  to  keep  all  disks  spinning  at  full  throughput    

•  ~2x  improvement  on  IO-­‐bound  workloads  on  12-­‐spindle  machines  

©2013 Cloudera, Inc. All Rights Reserved. 37

Short-­‐circuit  reads  

•  “Short  circuit”  allows  HDFS  clients  to  open  HDFS  block  files  directly  from  the  local  filesystem  

•  Avoids  context  switches  and  trips  back  and  forth  from  user  space  to  kernel  space  memory,  TCP  stack,  etc  

•  Uses  50%  less  CPU,  avoids  significant  latency  when  reading  data  from  Linux  buffer  cache  

•  SequenGal  IO  performance:  2x  improvement  •  Random  IO  performance:  3.5x  improvement  

•  This  has  existed  for  a  while  in  insecure  setups  only!  •  Clients  need  read  access  to  all  block  files  L  

©2013 Cloudera, Inc. All Rights Reserved. 38

Secure  short-­‐circuit  reads  (HDFS-­‐347)  

•  DataNode  conGnues  to  arbitrate  access  to  block  files  •  Opens  input  streams  and  passes  them  to  the  DFS  client  aAer  authenGcaGon  and  authorizaGon  checks  

•  Uses  a  trick  involving  Unix  Domain  Sockets  (sendmsg  with  SCM_RIGHTS)  

•  Now  perf-­‐sensiGve  apps  like  HBase,  Accumulo,  and  Impala  can  safely  configure  this  feature  in  all  environments  

©2013 Cloudera, Inc. All Rights Reserved. 39

Checksum  skipping  (HDFS-­‐3429)  

•  Problem:  HDFS  stores  block  data  and  block  checksums  in  separate  files  

•  A  truly  random  read  incurs  two  seeks  instead  of  one!  •  Solu4on:  HBase  now  stores  its  own  checksums  on  its  own  internal  64KB  blocks  

•  But  it  turns  out  that  prior  versions  of  HDFS  sGll  read  the  checksum,  even  if  the  client  flipped  verificaGon  off  

•  Fixing  this  yielded  a  40%  reduc4on  in  IOPS  and  latency  for  a  mulG-­‐TB  uniform  random-­‐read  workload!  

©2013 Cloudera, Inc. All Rights Reserved. 40

SGll  more  to  come?  

•  Not  a  ton  leA  on  the  read  path  • Write  path  sGll  has  some  low  hanging  fruit  –  hang  Gght  for  next  year  

•  Reality  check  (mulG-­‐threaded  random-­‐read)  •  Hadoop  1.0:    264MB/sec  •  Hadoop  2.x:  1393MB/sec  •  We’ve  come  a  long  way  (5x)  in  a  few  years!  

©2013 Cloudera, Inc. All Rights Reserved. 41

42

Other  key  new  features  

On-­‐the-­‐wire  EncrypGon  

•  Strong  encrypGon  now  supported  for  all  traffic  on  the  wire  

•  both  data  and  RPC  •  Configurable  cipher  (eg  RC5,  DES,  3DES)  

•  Developed  specifically  based  on  requirements  from  the  IC  

•  Reviewed  by  some  experts  here  today  (thanks!)  

©2013 Cloudera, Inc. All Rights Reserved. 43

Rolling  Upgrades  and  Wire  CompaGbility  

•  RPC  and  Data  Transfer  now  using  Protocol  Buffers  •  Easy  for  developers  to  add  new  features  without  breaking  compaGbility  

•  Allows  zero-­‐downGme  upgrade  between  minor  releases  

•  Planning  to  lock  down  client-­‐server  compaGbility  even  for  more  major  releases  in  2013  

©2013 Cloudera, Inc. All Rights Reserved. 44

45

What’s  up  next  in  2013?  

HDFS  Snapshots  

•  Full  support  for  efficent  subtree  snapshots  •  Point-­‐in-­‐Gme  “copy”  of  a  part  of  the  filesystem  •  Like  a  NetApp  NAS:  simple  administraGve  API  •  Copy-­‐on-­‐write  (instantaneous  snapshoyng)  •  Can  serve  as  input  for  MR,  distcp,  backups,  etc  

•  IniGally  read-­‐only,  some  thought  about  read-­‐write  in  the  future  

•  In  progress  now,  hoping  to  merge  into  trunk  by  summerGme  

©2013 Cloudera, Inc. All Rights Reserved. 46

Hierarchical  storage  

•  Early  exploraGon  into  SSD/Flash  •  AnGcipaGng  “hybrid”  storage  will  become  common  soon  •  What  performance  improvements  do  we  need  to  take  good  advantage  of  it?  

•  Tiered  caching  of  hot  data  onto  flash?  •  Explicit  storage  “pools”  for  apps  to  manage?  

•  Big-­‐RAM  boxes  •  256GB/box  not  so  expensive  anymore  •  How  can  we  best  make  use  of  all  this  RAM?  Caching!  

©2013 Cloudera, Inc. All Rights Reserved. 47

Storage  efficiency  

•  Transparent  re-­‐compression  of  cold  data?  • More  efficient  file  formats  

•  Columnar  storage  for  Hive,  Impala  •  Faster  to  operate  on  and  more  compact  

• Work  on  “fat  datanodes”  •  36-­‐72TB/node  will  require  some  investment  in  DataNode  scaling  

•  More  parallelism,  more  efficient  use  of  RAM,  etc.  

©2013 Cloudera, Inc. All Rights Reserved. 48