Click here to load reader
Upload
bigdatacloud
View
1.042
Download
1
Embed Size (px)
DESCRIPTION
Fail- Proofing Hadoop Clusters with Automatic Service Failover ( Financial Industry)- Michael Dalton, Ph.D., CTO, Co-founder, Zettaset Inc.
Citation preview
Big Data Cloud MeetupBig Data Cloud Meetup
Big Data & Cloud Computing - Help, Educate & Demystify.
September 8th 2011
Fail-Proofing Hadoop Clusters with Automated Service Failover
Michael Dalton, CTO Zettaset
Sept 8th 2011 Meetup
Problem Sept 8th 2011 Meetup
• Hadoop environments have many SPOFs• NameNode, JobTracker, Oozie• Kerberos
Ideal Solution Sept 8th 2011 Meetup
• Automated failover
• No data loss
• Handle all failover aspects (IP failover, etc)
• Failover all services
• No JobTracker = No MR
• No Kerberos = no new Kerberos authentication
Existing Solutions Sept 8th 2011 Meetup
• AvatarNode (NameNode, patch from FB)
• Replicate writes to a backup service
• BackupNameNode (NN, not committed)
• 'Hot' copy of NameNode, replicated
• All failover manual
Why is Failover Hard? Sept 8th 2011 Meetup
M1 M2
C1 C2
Data Loss Sept 8th 2011 Meetup
• Split-Brain issues lose data
• Multiple masters = data corruption
• Clients confused about who is up
• Problem for traditional HA environments
• Linux-HA, etc
• Heartbeat failure != Death
Theoretical Limits Sept 8th 2011 Meetup
• Can we solve this reliably?
• Fischer-Lynch-Paterson (FLP) Theorem
• Consensus impossible in asynchronous distributed system when even a single process can fail
• No free lunch
Revisiting Our Assumptions Sept 8th 2011 Meetup
• Drop fully asynchronous requirement
• What about leases?
• Masters obtain, renew a lease
• Shutdown if lease expires (not asynchronous)
• Assumes only bounded relative clock skew
• Everyone should agree on how fast time elapses
Master Failover Sept 8th 2011 Meetup
• Requires highly available lock / lease system
• Master obtains a lease to be master
• Replicates writes to a backup master
• If master loses lease, hold a new election
• Old master will shut down when lease expires
• If clock skew bounded, no split-brain!
Failover: Locks/Consensus Sept 8th 2011 Meetup
• Apache ZooKeeper – Hadoop subproject
• Highly-available distributed filesystem for distributed consensus problems
• Create election, membership, etc. using special-purpose FS semantics
• 'Ephemeral' files disappear when session lease expires
• 'Sequential' files have auto-incremented suffix
ZooKeeper Internals Sept 8th 2011 Meetup
• ZooKeeper consists of a quorum of nodes (typically 3-9)
• Majority vote elects a leader (via leases)
• Leader proposes all FS modifications
• Majority must approve a modification for it to be committed
Example: HBase Sept 8th 2011 Meetup
• Apache HBase has full automated multi-master failover
• Prospective masters register in ZooKeeper
• ZooKeeper ephemeral/sequential files used for election
• Clients lookup current address of master in ZooKeeper
• Failover fully automated
• All files stored on HDFS, so no replication issues
Failover: Replication Sept 8th 2011 Meetup
• HBase approach avoids replication issues with HDFS
• Kerberos, NN, Oozie, etc can't use HDFS
• Legacy compatibility (and for NN, circular deps)
• How can we add synchronous write replication?
• Can't break compatibility or change apps
Failover: Networking Sept 8th 2011 Meetup
• HBase avoids networking failover by storing master address in ZK
• Legacy services use IP or hostnames, not ZK, to connect to master
• Out-of-trunk patches to make ZK a DNS server
• But Java doesn't respect DNS TTLs anyway, complicating max time for failover
Failover: Networking Sept 8th 2011 Meetup
• HBase avoids networking failover by storing master address in ZK
• Legacy services use IP or hostnames, not ZK, to connect to master
• Out-of-trunk patches to make ZK a DNS server
• But Java doesn't respect DNS TTLs anyway, complicating max time for failover
• DNS introduces its own issues anyway...
IP Failover Sept 8th 2011 Meetup
• Instead, you can failover IP addresses
• Virtual IPs – if supported by router
• Otherwise, dynamically update routes as part of your failover
• New leader updates routing tables.
• For local area networks, ensure ARP tables updated
• Gratuitous ARP or store ARP information in ZK
Putting it all together Sept 8th 2011 Meetup
• Consensus/Election
• Use ZooKeeper, 3-9 node quorum
• State Replication
• Small data in ZK, Large data in HDFS
• If neither possible, DRBD
• Network Failover
• Store master address in ZK
• Or, perform IP failover
• Dynamically update routing tables, update ARPcache
Conclusion Sept 8th 2011 Meetup
• Fully automated failover is possible
• Design for synchronous replication
• Prevent split-brain
• Manage legacy compatibility
• Coming to Hadoop
• ZettaSet provides fully HA Hadoop