What's New and Upcoming in HDFS - the Hadoop Distributed File System

1

What’s new and upcoming in HDFS January 30, 2013 Todd Lipcon, SoAware Engineer [email protected] @tlipcon

IntroducGons

•  SoAware engineer on Cloudera’s Storage Engineering team

•  CommiIer and PMC Member for Apache Hadoop and Apache HBase

•  Projects in 2012 •  Responsible for >50% of the code for all phases of HA development

•  Also worked on many performance and stability improvements

•  This presentaGon is highly technical – please feel free to grab/email me later if you’d like to clarify anything!

2 ©2013 Cloudera, Inc. All Rights Reserved.

Outline

©2013 Cloudera, Inc. All Rights Reserved. 3

• HDFS 2.0 – what’s new in 2012? • HA Phase 1 (Q1 2012) • HA Phase 2 (Q2-‐Q4 2012) •  Performance improvements and other new features

• What’s coming in 2013? • HDFS Snapshots •  BeIer storage density and file formats •  Caching and Hierarchical Storage Management

4

HDFS-‐1623: completed March 2012

HDFS HA Phase 1 Review

HDFS HA Background

•  HDFS’s strength is its simple and robust design •  Single master NameNode maintains all metadata •  Scales to mul4-‐petabyte clusters easily on modern hardware

•  TradiGonally, the single master was also a single point of failure

•  Generally good availability, but not ops-‐friendly •  No hot patch ability, no hot reconfiguraGon •  No hot hardware replacement

•  Hadoop is now mission cri4cal: SPOF not OK!


HDFS HA Development Phase 1

•  Completed March 2012 (HDFS-‐1623) •  Introduced the StandbyNode, a hot backup for the HDFS NameNode.

•  Relied on shared storage to synchronize namespace state •  (e.g. a NAS filer appliance)

•  Allowed operators to manually trigger failover to the Standby

•  Sufficient for many HA use cases: avoided planned down4me for hardware and soAware upgrades, planned machine/OS maintenance, configuraGon changes, etc.


HDFS HA Architecture Phase 1

•  Parallel block reports sent to AcGve and Standby NameNodes

•  NameNode state shared by locaGng edit log on NAS over NFS

•  AcGve NameNode writes while Standby Node “tails”

•  Client failover done via client configuraGon •  Each client configured with the address of both NNs: try both to find acGve


HDFS HA Architecture Phase 1


Fencing and NFS

• Must avoid split-‐brain syndrome •  Both nodes think they are acGve and try to write to the same edit log. Your metadata becomes corrupt and requires manual intervenGon to restart

•  Configure a fencing script •  Script must ensure that prior acGve has stopped wriGng •  STONITH: shoot-‐the-‐other-‐node-‐in-‐the-‐head •  Storage fencing: e.g using NetApp ONTAP API to restrict filer access

•  Fencing script must succeed to have a successful failover


Shortcomings of Phase 1

•  Insufficient to protect against unplanned down4me •  Manual failover only: requires an operator to step in quickly aAer a crash

•  Various studies indicated this was the minority of downGme, but sGll important to address

•  Requirement of a NAS device made deployment complex, expensive, and error-‐prone

(we always knew this was just the first phase!)


HDFS HA Development Phase 2

• MulGple new features for high availability •  Automa4c failover, based on Apache ZooKeeper •  Remove dependency on NAS (network-‐aIached storage)

•  Address new HA use cases •  Avoid unplanned downGme due to soAware or hardware faults

•  Deploy in filer-‐less environments •  Completely stand-‐alone HA with no external hardware or soAware dependencies

•  no Linux-‐HA, filers, etc


12

HDFS-‐3042: completed May 2012

AutomaGc Failover Overview

AutomaGc Failover Goals

•  Automa4cally detect failure of the AcGve NameNode •  Hardware, soAware, network, etc.

•  Do not require operator interven4on to iniGate failover

•  Once failure is detected, process completes automaGcally •  Support manually ini4ated failover as first-‐class

•  Operators can sGll trigger failover without having to stop AcGve

•  Do not introduce a new SPOF •  All parts of auto-‐failover deployment must themselves be HA


AutomaGc Failover Architecture


• AutomaGc failover requires ZooKeeper • Not required for manual failover

• ZK makes it easy to: • Detect failure of AcGve NameNode • Determine which NameNode should become the AcGve NN



• New daemon: ZooKeeper Failover Controller (ZKFC)

•  In an auto failover deployment, run two ZKFCs • One per NameNode, on that NameNode machine

• ZKFC has three simple responsibili4es: • Monitors health of associated NameNode • ParGcipates in leader elec4on of NameNodes • Fences the other NameNode if it wins elecGon



17

HDFS-‐3077: completed October 2012

Removing the NAS dependency

Shared Storage in HDFS HA

•  The Standby NameNode synchronizes the namespace by following the AcGve NameNode’s transacGon log

•  Each operaGon (eg mkdir(/foo)) is wriIen to the log by the AcGve

•  The StandbyNode periodically reads all new edits and applies them to its own metadata structures

•  Reliable shared storage is required for correct opera4on

•  In phase 1, shared storage was synonymous with NFS-‐mounted NAS


Shortcomings of NFS-‐based approach

•  Custom hardware •  Lots of our customers don’t have SAN/NAS available in their datacenters

•  Costs money, Gme and experGse •  Extra “stuff” to monitor outside HDFS •  We just moved the SPOF, didn’t eliminate it!

•  Complicated •  Storage fencing, NFS mount opGons, mulGpath networking, etc •  OrganizaGonally complicated: dependencies on storage ops team

•  NFS issues •  Buggy client implementaGons, liIle control over Gmeout behavior, etc


Primary Requirements for Improved Storage

•  No special hardware (PDUs, NAS) •  No custom fencing configuraGon

•  Too complicated == too easy to misconfigure

•  No SPOFs •  punGng to filers isn’t a good opGon •  need something inherently distributed


Secondary Requirements

•  Configurable degree of fault tolerance •  Configure N nodes to tolerate (N-‐1)/2

• Making N bigger (within reasonable bounds) shouldn’t hurt performance. Implies:

•  Writes done in parallel, not pipelined •  Writes should not wait on slowest replica

•  Locate replicas on exisGng hardware investment (eg share with JobTracker, NN, SBN)


OperaGonal Requirements

•  Should be operable by exisGng Hadoop admins. Implies:

•  Same metrics system (“hadoop metrics”) •  Same configuraGon system (xml) •  Same logging infrastructure (log4j) •  Same security system (Kerberos-‐based)

•  Allow exisGng ops to easily deploy and manage the new feature

•  Allow exisGng Hadoop tools to monitor the feature •  (eg Cloudera Manager, Ganglia, etc)


Our soluGon: QuorumJournalManager

•  QuorumJournalManager (client) •  Plugs into JournalManager abstracGon in NN (instead of exisGng FileJournalManager)

•  Provides edit log storage abstracGon

•  JournalNode (server) •  Standalone daemon running on an odd number of nodes •  Provides actual storage of edit logs on local disks •  Could run inside other daemons in the future


Architecture


Commit protocol

•  NameNode accumulates edits locally as they are logged

•  On logSync(), sends accumulated batch to all JNs via Hadoop RPC

• Waits for success ACK from a majority of nodes •  Majority commit means that a single lagging or crashed replica does not impact NN latency

•  Latency @ NN = median(Latency @ JNs)

•  Uses the well-‐known Paxos algorithm to perform recovery of any in-‐flight edits on leader switchover


JN Fencing

•  How do we prevent split-‐brain? •  Each instance of QJM is assigned a unique epoch number

•  provides a strong ordering between client NNs •  Each IPC contains the client’s epoch •  JN remembers on disk the highest epoch it has seen •  Any request from an earlier epoch is rejected. Any from a newer one is recorded on disk

•  Distributed Systems folks may recognize this technique from Paxos and other literature


Fencing with epochs

•  Fencing is now implicit •  The act of becoming acGve causes any earlier acGve NN to be fenced out

•  Since a quorum of nodes has accepted the new acGve, any other IPC by an earlier epoch number can’t get quorum

•  Eliminates confusing and error-‐prone custom fencing configura4on


Other implementaGon features

•  Hadoop Metrics •  lag, percenGle latencies, etc from perspecGve of JN, NN •  metrics for queued txns, % of Gme each JN fell behind, etc, to help suss out a slow JN before it causes problems

•  Security •  full Kerberos and SSL support: edits can be opGonally encrypted in-‐flight, and all access is mutually authenGcated


TesGng

•  Randomized fault test •  Runs all communicaGons in a single thread with determinisGc order and fault injecGons based on a seed

•  Caught a number of really subtle bugs along the way •  Run as an MR job: 5000 fault tests in parallel •  MulGple CPU-‐years of stress tesGng: found 2 bugs in JeIy!

•  Cluster tesGng: 100-‐node, MR, HBase, Hive, etc •  Commit latency in pracGce: within same range as local disks (beIer than one of two local disks, worse than the other one)


Deployment

• Most customers running 3 JNs (tolerate 1 failure) •  1 on NN, 1 on SBN, 1 on JobTracker/ResourceManager •  OpGonally run 2 more (eg on basGon/gateway nodes) to tolerate 2 failures

•  No new hardware investment

•  Refer to docs for detailed configuraGon info


Status

• Merged into Hadoop development trunk in early October

•  Available in CDH4.1, will be in upcoming Hadoop 2.1 •  Deployed at several customer/community sites with good success so far (no lost data)

•  In contrast, we’ve had several issues with misconfigured NFS filers causing downGme

•  Highly recommend you use Quorum Journaling instead of NFS!


Summary of HA Improvements

•  Run an acGve NameNode and a hot Standby NameNode

•  AutomaGcally triggers seamless failover using Apache ZooKeeper

•  Stores shared metadata on QuorumJournalManager: a fully distributed, redundant, low latency journaling system.

•  All improvements available now in HDFS branch-‐2 and CDH4.1


34

HDFS Performance Update

Performance Improvements (overview)

•  Several improvements made for Impala •  Much faster libhdfs •  APIs for spindle-‐based scheduling

•  Other more general improvements (especially for HBase and Accumulo)

•  Ability to read directly from block files in secure environments

•  Ability for applicaGons to perform their own checksums and eliminate IOPS


•  This can also benefit apps like HBase, Accumulo, and MR with a bit more work (TBD in 2013)

libhdfs “direct read” support (HDFS-‐2834)

36

Disk locaGons API (HDFS-‐3672)

•  HDFS has always exposed node locality informaGon •  Map<Block, List<Datanode Addresess>>

•  Now also can expose disk locality informaGon •  Map<Replica, List<Spindle IdenGfiers>>

•  Impala uses this API to keep all disks spinning at full throughput

•  ~2x improvement on IO-‐bound workloads on 12-‐spindle machines


Short-‐circuit reads

•  “Short circuit” allows HDFS clients to open HDFS block files directly from the local filesystem

•  Avoids context switches and trips back and forth from user space to kernel space memory, TCP stack, etc

•  Uses 50% less CPU, avoids significant latency when reading data from Linux buffer cache

•  SequenGal IO performance: 2x improvement •  Random IO performance: 3.5x improvement

•  This has existed for a while in insecure setups only! •  Clients need read access to all block files L


Secure short-‐circuit reads (HDFS-‐347)

•  DataNode conGnues to arbitrate access to block files •  Opens input streams and passes them to the DFS client aAer authenGcaGon and authorizaGon checks

•  Uses a trick involving Unix Domain Sockets (sendmsg with SCM_RIGHTS)

•  Now perf-‐sensiGve apps like HBase, Accumulo, and Impala can safely configure this feature in all environments


Checksum skipping (HDFS-‐3429)

•  Problem: HDFS stores block data and block checksums in separate files

•  A truly random read incurs two seeks instead of one! •  Solu4on: HBase now stores its own checksums on its own internal 64KB blocks

•  But it turns out that prior versions of HDFS sGll read the checksum, even if the client flipped verificaGon off

•  Fixing this yielded a 40% reduc4on in IOPS and latency for a mulG-‐TB uniform random-‐read workload!


SGll more to come?

•  Not a ton leA on the read path • Write path sGll has some low hanging fruit – hang Gght for next year

•  Reality check (mulG-‐threaded random-‐read) •  Hadoop 1.0: 264MB/sec •  Hadoop 2.x: 1393MB/sec •  We’ve come a long way (5x) in a few years!


42

Other key new features

On-‐the-‐wire EncrypGon

•  Strong encrypGon now supported for all traffic on the wire

•  both data and RPC •  Configurable cipher (eg RC5, DES, 3DES)

•  Developed specifically based on requirements from the IC

•  Reviewed by some experts here today (thanks!)


Rolling Upgrades and Wire CompaGbility

•  RPC and Data Transfer now using Protocol Buffers •  Easy for developers to add new features without breaking compaGbility

•  Allows zero-‐downGme upgrade between minor releases

•  Planning to lock down client-‐server compaGbility even for more major releases in 2013


45

What’s up next in 2013?

HDFS Snapshots

•  Full support for efficent subtree snapshots •  Point-‐in-‐Gme “copy” of a part of the filesystem •  Like a NetApp NAS: simple administraGve API •  Copy-‐on-‐write (instantaneous snapshoyng) •  Can serve as input for MR, distcp, backups, etc

•  IniGally read-‐only, some thought about read-‐write in the future

•  In progress now, hoping to merge into trunk by summerGme


Hierarchical storage

•  Early exploraGon into SSD/Flash •  AnGcipaGng “hybrid” storage will become common soon •  What performance improvements do we need to take good advantage of it?

•  Tiered caching of hot data onto flash? •  Explicit storage “pools” for apps to manage?

•  Big-‐RAM boxes •  256GB/box not so expensive anymore •  How can we best make use of all this RAM? Caching!


Storage efficiency

•  Transparent re-‐compression of cold data? • More efficient file formats

•  Columnar storage for Hive, Impala •  Faster to operate on and more compact

• Work on “fat datanodes” •  36-‐72TB/node will require some investment in DataNode scaling

•  More parallelism, more efficient use of RAM, etc.


Documents

What's New and Upcoming in HDFS - the Hadoop Distributed File System