BigDataCloud Sept 8 2011 Meetup - Fail-Proofing Hadoop Clusters with Automatic Service Failover by Mike Dalton of Zettaset

Big Data Cloud MeetupBig Data Cloud Meetup

Big Data & Cloud Computing - Help, Educate & Demystify.

September 8th 2011

Fail-Proofing Hadoop Clusters with Automated Service Failover

Michael Dalton, CTO Zettaset

Sept 8th 2011 Meetup

Problem Sept 8th 2011 Meetup

• Hadoop environments have many SPOFs• NameNode, JobTracker, Oozie• Kerberos

Ideal Solution Sept 8th 2011 Meetup

• Automated failover

• No data loss

• Handle all failover aspects (IP failover, etc)

• Failover all services

• No JobTracker = No MR

• No Kerberos = no new Kerberos authentication

Existing Solutions Sept 8th 2011 Meetup

• AvatarNode (NameNode, patch from FB)

• Replicate writes to a backup service

• BackupNameNode (NN, not committed)

• 'Hot' copy of NameNode, replicated

• All failover manual

Why is Failover Hard? Sept 8th 2011 Meetup

M1 M2

C1 C2

Data Loss Sept 8th 2011 Meetup

• Split-Brain issues lose data

• Multiple masters = data corruption

• Clients confused about who is up

• Problem for traditional HA environments

• Linux-HA, etc

• Heartbeat failure != Death

Theoretical Limits Sept 8th 2011 Meetup

• Can we solve this reliably?

• Fischer-Lynch-Paterson (FLP) Theorem

• Consensus impossible in asynchronous distributed system when even a single process can fail

• No free lunch

Revisiting Our Assumptions Sept 8th 2011 Meetup

• Drop fully asynchronous requirement

• What about leases?

• Masters obtain, renew a lease

• Shutdown if lease expires (not asynchronous)

• Assumes only bounded relative clock skew

• Everyone should agree on how fast time elapses

Master Failover Sept 8th 2011 Meetup

• Requires highly available lock / lease system

• Master obtains a lease to be master

• Replicates writes to a backup master

• If master loses lease, hold a new election

• Old master will shut down when lease expires

• If clock skew bounded, no split-brain!

Failover: Locks/Consensus Sept 8th 2011 Meetup

• Apache ZooKeeper – Hadoop subproject

• Highly-available distributed filesystem for distributed consensus problems

• Create election, membership, etc. using special-purpose FS semantics

• 'Ephemeral' files disappear when session lease expires

• 'Sequential' files have auto-incremented suffix

ZooKeeper Internals Sept 8th 2011 Meetup

• ZooKeeper consists of a quorum of nodes (typically 3-9)

• Majority vote elects a leader (via leases)

• Leader proposes all FS modifications

• Majority must approve a modification for it to be committed

Example: HBase Sept 8th 2011 Meetup

• Apache HBase has full automated multi-master failover

• Prospective masters register in ZooKeeper

• ZooKeeper ephemeral/sequential files used for election

• Clients lookup current address of master in ZooKeeper

• Failover fully automated

• All files stored on HDFS, so no replication issues

Failover: Replication Sept 8th 2011 Meetup

• HBase approach avoids replication issues with HDFS

• Kerberos, NN, Oozie, etc can't use HDFS

• Legacy compatibility (and for NN, circular deps)

• How can we add synchronous write replication?

• Can't break compatibility or change apps

Failover: Networking Sept 8th 2011 Meetup

• HBase avoids networking failover by storing master address in ZK

• Legacy services use IP or hostnames, not ZK, to connect to master

• Out-of-trunk patches to make ZK a DNS server

• But Java doesn't respect DNS TTLs anyway, complicating max time for failover

Failover: Networking Sept 8th 2011 Meetup

• HBase avoids networking failover by storing master address in ZK

• Legacy services use IP or hostnames, not ZK, to connect to master

• Out-of-trunk patches to make ZK a DNS server

• But Java doesn't respect DNS TTLs anyway, complicating max time for failover

• DNS introduces its own issues anyway...

IP Failover Sept 8th 2011 Meetup

• Instead, you can failover IP addresses

• Virtual IPs – if supported by router

• Otherwise, dynamically update routes as part of your failover

• New leader updates routing tables.

• For local area networks, ensure ARP tables updated

• Gratuitous ARP or store ARP information in ZK

Putting it all together Sept 8th 2011 Meetup

• Consensus/Election

• Use ZooKeeper, 3-9 node quorum

• State Replication

• Small data in ZK, Large data in HDFS

• If neither possible, DRBD

• Network Failover

• Store master address in ZK

• Or, perform IP failover

• Dynamically update routing tables, update ARPcache

Conclusion Sept 8th 2011 Meetup

• Fully automated failover is possible

• Design for synchronous replication

• Prevent split-brain

• Manage legacy compatibility

• Coming to Hadoop

• ZettaSet provides fully HA Hadoop

Technology

BigDataCloud Sept 8 2011 Meetup - Fail-Proofing Hadoop Clusters with Automatic Service Failover by Mike Dalton of Zettaset