Hadoop Distributed File System Reliability and Durability at Facebook

I accidentally the Namenode HDFS reliability at Facebook

Andrew Ryan Facebook April 2012

The HDFS Namenode: SPOF by design

▪  Single Point of Failure by design

▪  All metadata operations go through Namenode

▪  Early designers made tradeoffs: features & performance first

Namenode Secondary Namenode

Datanode Clients

Data

Simplified HDFS Architecture: Namenode as SPOF

HDFS major use cases at Facebook Data Warehouse and Facebook Messages

Data Warehouse Facebook Messages

# of clusters <10 10’s Size of clusters Large

(100’s – 1000’s of nodes)

Small (~100 nodes)

Processing workload MapReduce batch jobs

HBase transactions

Namenode load Very heavy Very light End-user downtime impact

None Users without Messages

HDFS at Facebook: 2009-2012 Some things have changed…

2009 2012 # HDFS clusters 1 >100

Largest HDFS cluster size (TB) 600TB >100PB

Largest HDFS cluster size (# files) 10 million 200 million

HDFS cluster types MapReduce MapReduce, HBase, MySQL backups, +more

HDFS at Facebook: 2009-2012 …and some things have not

2009 2012 Single points of failure in HDFS Namenode Namenode

HDFS cluster restart time 60 minutes 60 minutes

Namenode failover method Manual, complicated

Manual, complicated

SPOF Namenode as a cause of downtime

Unknown Unknown

Data Warehouse

▪  Storage and querying of structured log data using Hive and Hadoop MapReduce

▪  Composed of dozens of tools/components

▪  A “vigorous and creative” user population Storage (HDFS)

Compute (MapReduce)

Query (Hive)

Workflow (Nocron)

UI Tools

Hadoop

Data Warehouse: all incidents 41% are HDFS-related

Data Warehouse: SPOF Namenode incidents 10% are SPOF Namenode

Facebook Messages

Messages Cell

Application Server

HBase/HDFS/ZK

Haystack

Clients (www, chat, MTA, etc.)

Mail

Anti-spam

Mail Servers

User Directory Service

Outbound Mail

Messages: all incidents 16% are HDFS-related

Messages: SPOF Namenode incidents 10% are SPOF Namenode

What would happen if… Instead of this…

Namenode Secondary Namenode

Datanode Clients

Data

Simplified HDFS Architecture: Namenode as SPOF

What would happen if… We had this!

Primary Namenode

Standby Namenode

Datanode Clients

Data

Simplified HDFS Architecture: Highly Available Namenode

AvatarNode is our solution

AvatarNode datanode view AvatarNode client view

AvatarNode is… ▪  A two-node, highly available Namenode with manual failover

▪  In production today at Facebook

▪  Open-sourced, based on Hadoop 0.20: https://github.com/facebook/hadoop-20

AvatarNode does not… ▪  Eliminate the dependency on shared storage for image/edits

▪  Provide instant failover (~1 second per million blocks+files)

▪  Provide automated failover

▪  Guarantee I/O fencing for Primary/Standby (although precautions are taken)

▪  Require Zookeeper at all times for proper normal operation (required for failover)

▪  Allow for >2 Namenodes to participate in an HA cluster

▪  Have any special network requirements

Wrapping up… ▪  The SPOF Namenode is a weak link of HDFS’s design

▪  In our services which use HDFS, we estimate we could eliminate:

▪  10% of service downtime from unscheduled outages

▪  20-50% of downtime from scheduled maintenance

▪  AvatarNode is Facebook’s solution for 0.20, available today

▪  Other Namenode HA solutions are being worked on in HDFS trunk (HDFS-1623)

Questions?

Page 19

Sessions will resume at 11:25am

Technology

Hadoop Distributed File System Reliability and Durability at Facebook