Upload
cloudera-inc
View
9.892
Download
2
Tags:
Embed Size (px)
Citation preview
1
What’s new and upcoming in HDFS January 30, 2013 Todd Lipcon, SoAware Engineer [email protected] @tlipcon
IntroducGons
• SoAware engineer on Cloudera’s Storage Engineering team
• CommiIer and PMC Member for Apache Hadoop and Apache HBase
• Projects in 2012 • Responsible for >50% of the code for all phases of HA development
• Also worked on many performance and stability improvements
• This presentaGon is highly technical – please feel free to grab/email me later if you’d like to clarify anything!
2 ©2013 Cloudera, Inc. All Rights Reserved.
Outline
©2013 Cloudera, Inc. All Rights Reserved. 3
• HDFS 2.0 – what’s new in 2012? • HA Phase 1 (Q1 2012) • HA Phase 2 (Q2-‐Q4 2012) • Performance improvements and other new features
• What’s coming in 2013? • HDFS Snapshots • BeIer storage density and file formats • Caching and Hierarchical Storage Management
HDFS HA Background
• HDFS’s strength is its simple and robust design • Single master NameNode maintains all metadata • Scales to mul4-‐petabyte clusters easily on modern hardware
• TradiGonally, the single master was also a single point of failure
• Generally good availability, but not ops-‐friendly • No hot patch ability, no hot reconfiguraGon • No hot hardware replacement
• Hadoop is now mission cri4cal: SPOF not OK!
5 ©2013 Cloudera, Inc. All Rights Reserved.
HDFS HA Development Phase 1
• Completed March 2012 (HDFS-‐1623) • Introduced the StandbyNode, a hot backup for the HDFS NameNode.
• Relied on shared storage to synchronize namespace state • (e.g. a NAS filer appliance)
• Allowed operators to manually trigger failover to the Standby
• Sufficient for many HA use cases: avoided planned down4me for hardware and soAware upgrades, planned machine/OS maintenance, configuraGon changes, etc.
6 ©2013 Cloudera, Inc. All Rights Reserved.
HDFS HA Architecture Phase 1
• Parallel block reports sent to AcGve and Standby NameNodes
• NameNode state shared by locaGng edit log on NAS over NFS
• AcGve NameNode writes while Standby Node “tails”
• Client failover done via client configuraGon • Each client configured with the address of both NNs: try both to find acGve
7 ©2013 Cloudera, Inc. All Rights Reserved.
Fencing and NFS
• Must avoid split-‐brain syndrome • Both nodes think they are acGve and try to write to the same edit log. Your metadata becomes corrupt and requires manual intervenGon to restart
• Configure a fencing script • Script must ensure that prior acGve has stopped wriGng • STONITH: shoot-‐the-‐other-‐node-‐in-‐the-‐head • Storage fencing: e.g using NetApp ONTAP API to restrict filer access
• Fencing script must succeed to have a successful failover
9 ©2013 Cloudera, Inc. All Rights Reserved.
Shortcomings of Phase 1
• Insufficient to protect against unplanned down4me • Manual failover only: requires an operator to step in quickly aAer a crash
• Various studies indicated this was the minority of downGme, but sGll important to address
• Requirement of a NAS device made deployment complex, expensive, and error-‐prone
(we always knew this was just the first phase!)
10 ©2013 Cloudera, Inc. All Rights Reserved.
HDFS HA Development Phase 2
• MulGple new features for high availability • Automa4c failover, based on Apache ZooKeeper • Remove dependency on NAS (network-‐aIached storage)
• Address new HA use cases • Avoid unplanned downGme due to soAware or hardware faults
• Deploy in filer-‐less environments • Completely stand-‐alone HA with no external hardware or soAware dependencies
• no Linux-‐HA, filers, etc
11 ©2013 Cloudera, Inc. All Rights Reserved.
AutomaGc Failover Goals
• Automa4cally detect failure of the AcGve NameNode • Hardware, soAware, network, etc.
• Do not require operator interven4on to iniGate failover
• Once failure is detected, process completes automaGcally • Support manually ini4ated failover as first-‐class
• Operators can sGll trigger failover without having to stop AcGve
• Do not introduce a new SPOF • All parts of auto-‐failover deployment must themselves be HA
13 ©2013 Cloudera, Inc. All Rights Reserved.
AutomaGc Failover Architecture
©2013 Cloudera, Inc. All Rights Reserved. 14
• AutomaGc failover requires ZooKeeper • Not required for manual failover
• ZK makes it easy to: • Detect failure of AcGve NameNode • Determine which NameNode should become the AcGve NN
AutomaGc Failover Architecture
©2013 Cloudera, Inc. All Rights Reserved. 15
• New daemon: ZooKeeper Failover Controller (ZKFC)
• In an auto failover deployment, run two ZKFCs • One per NameNode, on that NameNode machine
• ZKFC has three simple responsibili4es: • Monitors health of associated NameNode • ParGcipates in leader elec4on of NameNodes • Fences the other NameNode if it wins elecGon
Shared Storage in HDFS HA
• The Standby NameNode synchronizes the namespace by following the AcGve NameNode’s transacGon log
• Each operaGon (eg mkdir(/foo)) is wriIen to the log by the AcGve
• The StandbyNode periodically reads all new edits and applies them to its own metadata structures
• Reliable shared storage is required for correct opera4on
• In phase 1, shared storage was synonymous with NFS-‐mounted NAS
18 ©2013 Cloudera, Inc. All Rights Reserved.
Shortcomings of NFS-‐based approach
• Custom hardware • Lots of our customers don’t have SAN/NAS available in their datacenters
• Costs money, Gme and experGse • Extra “stuff” to monitor outside HDFS • We just moved the SPOF, didn’t eliminate it!
• Complicated • Storage fencing, NFS mount opGons, mulGpath networking, etc • OrganizaGonally complicated: dependencies on storage ops team
• NFS issues • Buggy client implementaGons, liIle control over Gmeout behavior, etc
19 ©2013 Cloudera, Inc. All Rights Reserved.
Primary Requirements for Improved Storage
• No special hardware (PDUs, NAS) • No custom fencing configuraGon
• Too complicated == too easy to misconfigure
• No SPOFs • punGng to filers isn’t a good opGon • need something inherently distributed
©2013 Cloudera, Inc. All Rights Reserved. 20
Secondary Requirements
• Configurable degree of fault tolerance • Configure N nodes to tolerate (N-‐1)/2
• Making N bigger (within reasonable bounds) shouldn’t hurt performance. Implies:
• Writes done in parallel, not pipelined • Writes should not wait on slowest replica
• Locate replicas on exisGng hardware investment (eg share with JobTracker, NN, SBN)
©2013 Cloudera, Inc. All Rights Reserved. 21
OperaGonal Requirements
• Should be operable by exisGng Hadoop admins. Implies:
• Same metrics system (“hadoop metrics”) • Same configuraGon system (xml) • Same logging infrastructure (log4j) • Same security system (Kerberos-‐based)
• Allow exisGng ops to easily deploy and manage the new feature
• Allow exisGng Hadoop tools to monitor the feature • (eg Cloudera Manager, Ganglia, etc)
©2013 Cloudera, Inc. All Rights Reserved. 22
Our soluGon: QuorumJournalManager
• QuorumJournalManager (client) • Plugs into JournalManager abstracGon in NN (instead of exisGng FileJournalManager)
• Provides edit log storage abstracGon
• JournalNode (server) • Standalone daemon running on an odd number of nodes • Provides actual storage of edit logs on local disks • Could run inside other daemons in the future
©2013 Cloudera, Inc. All Rights Reserved. 23
Commit protocol
• NameNode accumulates edits locally as they are logged
• On logSync(), sends accumulated batch to all JNs via Hadoop RPC
• Waits for success ACK from a majority of nodes • Majority commit means that a single lagging or crashed replica does not impact NN latency
• Latency @ NN = median(Latency @ JNs)
• Uses the well-‐known Paxos algorithm to perform recovery of any in-‐flight edits on leader switchover
©2013 Cloudera, Inc. All Rights Reserved. 25
JN Fencing
• How do we prevent split-‐brain? • Each instance of QJM is assigned a unique epoch number
• provides a strong ordering between client NNs • Each IPC contains the client’s epoch • JN remembers on disk the highest epoch it has seen • Any request from an earlier epoch is rejected. Any from a newer one is recorded on disk
• Distributed Systems folks may recognize this technique from Paxos and other literature
©2013 Cloudera, Inc. All Rights Reserved. 26
Fencing with epochs
• Fencing is now implicit • The act of becoming acGve causes any earlier acGve NN to be fenced out
• Since a quorum of nodes has accepted the new acGve, any other IPC by an earlier epoch number can’t get quorum
• Eliminates confusing and error-‐prone custom fencing configura4on
©2013 Cloudera, Inc. All Rights Reserved. 27
Other implementaGon features
• Hadoop Metrics • lag, percenGle latencies, etc from perspecGve of JN, NN • metrics for queued txns, % of Gme each JN fell behind, etc, to help suss out a slow JN before it causes problems
• Security • full Kerberos and SSL support: edits can be opGonally encrypted in-‐flight, and all access is mutually authenGcated
©2013 Cloudera, Inc. All Rights Reserved. 28
TesGng
• Randomized fault test • Runs all communicaGons in a single thread with determinisGc order and fault injecGons based on a seed
• Caught a number of really subtle bugs along the way • Run as an MR job: 5000 fault tests in parallel • MulGple CPU-‐years of stress tesGng: found 2 bugs in JeIy!
• Cluster tesGng: 100-‐node, MR, HBase, Hive, etc • Commit latency in pracGce: within same range as local disks (beIer than one of two local disks, worse than the other one)
©2013 Cloudera, Inc. All Rights Reserved. 30
Deployment
• Most customers running 3 JNs (tolerate 1 failure) • 1 on NN, 1 on SBN, 1 on JobTracker/ResourceManager • OpGonally run 2 more (eg on basGon/gateway nodes) to tolerate 2 failures
• No new hardware investment
• Refer to docs for detailed configuraGon info
©2013 Cloudera, Inc. All Rights Reserved. 31
Status
• Merged into Hadoop development trunk in early October
• Available in CDH4.1, will be in upcoming Hadoop 2.1 • Deployed at several customer/community sites with good success so far (no lost data)
• In contrast, we’ve had several issues with misconfigured NFS filers causing downGme
• Highly recommend you use Quorum Journaling instead of NFS!
©2013 Cloudera, Inc. All Rights Reserved. 32
Summary of HA Improvements
• Run an acGve NameNode and a hot Standby NameNode
• AutomaGcally triggers seamless failover using Apache ZooKeeper
• Stores shared metadata on QuorumJournalManager: a fully distributed, redundant, low latency journaling system.
• All improvements available now in HDFS branch-‐2 and CDH4.1
©2013 Cloudera, Inc. All Rights Reserved. 33
Performance Improvements (overview)
• Several improvements made for Impala • Much faster libhdfs • APIs for spindle-‐based scheduling
• Other more general improvements (especially for HBase and Accumulo)
• Ability to read directly from block files in secure environments
• Ability for applicaGons to perform their own checksums and eliminate IOPS
©2013 Cloudera, Inc. All Rights Reserved. 35
• This can also benefit apps like HBase, Accumulo, and MR with a bit more work (TBD in 2013)
libhdfs “direct read” support (HDFS-‐2834)
36
Disk locaGons API (HDFS-‐3672)
• HDFS has always exposed node locality informaGon • Map<Block, List<Datanode Addresess>>
• Now also can expose disk locality informaGon • Map<Replica, List<Spindle IdenGfiers>>
• Impala uses this API to keep all disks spinning at full throughput
• ~2x improvement on IO-‐bound workloads on 12-‐spindle machines
©2013 Cloudera, Inc. All Rights Reserved. 37
Short-‐circuit reads
• “Short circuit” allows HDFS clients to open HDFS block files directly from the local filesystem
• Avoids context switches and trips back and forth from user space to kernel space memory, TCP stack, etc
• Uses 50% less CPU, avoids significant latency when reading data from Linux buffer cache
• SequenGal IO performance: 2x improvement • Random IO performance: 3.5x improvement
• This has existed for a while in insecure setups only! • Clients need read access to all block files L
©2013 Cloudera, Inc. All Rights Reserved. 38
Secure short-‐circuit reads (HDFS-‐347)
• DataNode conGnues to arbitrate access to block files • Opens input streams and passes them to the DFS client aAer authenGcaGon and authorizaGon checks
• Uses a trick involving Unix Domain Sockets (sendmsg with SCM_RIGHTS)
• Now perf-‐sensiGve apps like HBase, Accumulo, and Impala can safely configure this feature in all environments
©2013 Cloudera, Inc. All Rights Reserved. 39
Checksum skipping (HDFS-‐3429)
• Problem: HDFS stores block data and block checksums in separate files
• A truly random read incurs two seeks instead of one! • Solu4on: HBase now stores its own checksums on its own internal 64KB blocks
• But it turns out that prior versions of HDFS sGll read the checksum, even if the client flipped verificaGon off
• Fixing this yielded a 40% reduc4on in IOPS and latency for a mulG-‐TB uniform random-‐read workload!
©2013 Cloudera, Inc. All Rights Reserved. 40
SGll more to come?
• Not a ton leA on the read path • Write path sGll has some low hanging fruit – hang Gght for next year
• Reality check (mulG-‐threaded random-‐read) • Hadoop 1.0: 264MB/sec • Hadoop 2.x: 1393MB/sec • We’ve come a long way (5x) in a few years!
©2013 Cloudera, Inc. All Rights Reserved. 41
On-‐the-‐wire EncrypGon
• Strong encrypGon now supported for all traffic on the wire
• both data and RPC • Configurable cipher (eg RC5, DES, 3DES)
• Developed specifically based on requirements from the IC
• Reviewed by some experts here today (thanks!)
©2013 Cloudera, Inc. All Rights Reserved. 43
Rolling Upgrades and Wire CompaGbility
• RPC and Data Transfer now using Protocol Buffers • Easy for developers to add new features without breaking compaGbility
• Allows zero-‐downGme upgrade between minor releases
• Planning to lock down client-‐server compaGbility even for more major releases in 2013
©2013 Cloudera, Inc. All Rights Reserved. 44
HDFS Snapshots
• Full support for efficent subtree snapshots • Point-‐in-‐Gme “copy” of a part of the filesystem • Like a NetApp NAS: simple administraGve API • Copy-‐on-‐write (instantaneous snapshoyng) • Can serve as input for MR, distcp, backups, etc
• IniGally read-‐only, some thought about read-‐write in the future
• In progress now, hoping to merge into trunk by summerGme
©2013 Cloudera, Inc. All Rights Reserved. 46
Hierarchical storage
• Early exploraGon into SSD/Flash • AnGcipaGng “hybrid” storage will become common soon • What performance improvements do we need to take good advantage of it?
• Tiered caching of hot data onto flash? • Explicit storage “pools” for apps to manage?
• Big-‐RAM boxes • 256GB/box not so expensive anymore • How can we best make use of all this RAM? Caching!
©2013 Cloudera, Inc. All Rights Reserved. 47
Storage efficiency
• Transparent re-‐compression of cold data? • More efficient file formats
• Columnar storage for Hive, Impala • Faster to operate on and more compact
• Work on “fat datanodes” • 36-‐72TB/node will require some investment in DataNode scaling
• More parallelism, more efficient use of RAM, etc.
©2013 Cloudera, Inc. All Rights Reserved. 48