19

Click here to load reader

BigDataCloud Sept 8 2011 Meetup - Fail-Proofing Hadoop Clusters with Automatic Service Failover by Mike Dalton of Zettaset

Embed Size (px)

DESCRIPTION

Fail- Proofing Hadoop Clusters with Automatic Service Failover ( Financial Industry)- Michael Dalton, Ph.D., CTO, Co-founder, Zettaset Inc.

Citation preview

Page 1: BigDataCloud Sept 8 2011 Meetup - Fail-Proofing Hadoop Clusters with Automatic Service Failover by Mike Dalton of Zettaset

Big Data Cloud MeetupBig Data Cloud Meetup

Big Data & Cloud Computing - Help, Educate & Demystify.

September 8th 2011

Page 2: BigDataCloud Sept 8 2011 Meetup - Fail-Proofing Hadoop Clusters with Automatic Service Failover by Mike Dalton of Zettaset

Fail-Proofing Hadoop Clusters with Automated Service Failover

Michael Dalton, CTO Zettaset

Sept 8th 2011 Meetup

Page 3: BigDataCloud Sept 8 2011 Meetup - Fail-Proofing Hadoop Clusters with Automatic Service Failover by Mike Dalton of Zettaset

Problem Sept 8th 2011 Meetup

• Hadoop environments have many SPOFs• NameNode, JobTracker, Oozie• Kerberos

Page 4: BigDataCloud Sept 8 2011 Meetup - Fail-Proofing Hadoop Clusters with Automatic Service Failover by Mike Dalton of Zettaset

Ideal Solution Sept 8th 2011 Meetup

• Automated failover

• No data loss

• Handle all failover aspects (IP failover, etc)

• Failover all services

• No JobTracker = No MR

• No Kerberos = no new Kerberos authentication

Page 5: BigDataCloud Sept 8 2011 Meetup - Fail-Proofing Hadoop Clusters with Automatic Service Failover by Mike Dalton of Zettaset

Existing Solutions Sept 8th 2011 Meetup

• AvatarNode (NameNode, patch from FB)

• Replicate writes to a backup service

• BackupNameNode (NN, not committed)

• 'Hot' copy of NameNode, replicated

• All failover manual

Page 6: BigDataCloud Sept 8 2011 Meetup - Fail-Proofing Hadoop Clusters with Automatic Service Failover by Mike Dalton of Zettaset

Why is Failover Hard? Sept 8th 2011 Meetup

M1 M2

C1 C2

Page 7: BigDataCloud Sept 8 2011 Meetup - Fail-Proofing Hadoop Clusters with Automatic Service Failover by Mike Dalton of Zettaset

Data Loss Sept 8th 2011 Meetup

• Split-Brain issues lose data

• Multiple masters = data corruption

• Clients confused about who is up

• Problem for traditional HA environments

• Linux-HA, etc

• Heartbeat failure != Death

Page 8: BigDataCloud Sept 8 2011 Meetup - Fail-Proofing Hadoop Clusters with Automatic Service Failover by Mike Dalton of Zettaset

Theoretical Limits Sept 8th 2011 Meetup

• Can we solve this reliably?

• Fischer-Lynch-Paterson (FLP) Theorem

• Consensus impossible in asynchronous distributed system when even a single process can fail

• No free lunch

Page 9: BigDataCloud Sept 8 2011 Meetup - Fail-Proofing Hadoop Clusters with Automatic Service Failover by Mike Dalton of Zettaset

Revisiting Our Assumptions Sept 8th 2011 Meetup

• Drop fully asynchronous requirement

• What about leases?

• Masters obtain, renew a lease

• Shutdown if lease expires (not asynchronous)

• Assumes only bounded relative clock skew

• Everyone should agree on how fast time elapses

Page 10: BigDataCloud Sept 8 2011 Meetup - Fail-Proofing Hadoop Clusters with Automatic Service Failover by Mike Dalton of Zettaset

Master Failover Sept 8th 2011 Meetup

• Requires highly available lock / lease system

• Master obtains a lease to be master

• Replicates writes to a backup master

• If master loses lease, hold a new election

• Old master will shut down when lease expires

• If clock skew bounded, no split-brain!

Page 11: BigDataCloud Sept 8 2011 Meetup - Fail-Proofing Hadoop Clusters with Automatic Service Failover by Mike Dalton of Zettaset

Failover: Locks/Consensus Sept 8th 2011 Meetup

• Apache ZooKeeper – Hadoop subproject

• Highly-available distributed filesystem for distributed consensus problems

• Create election, membership, etc. using special-purpose FS semantics

• 'Ephemeral' files disappear when session lease expires

• 'Sequential' files have auto-incremented suffix

Page 12: BigDataCloud Sept 8 2011 Meetup - Fail-Proofing Hadoop Clusters with Automatic Service Failover by Mike Dalton of Zettaset

ZooKeeper Internals Sept 8th 2011 Meetup

• ZooKeeper consists of a quorum of nodes (typically 3-9)

• Majority vote elects a leader (via leases)

• Leader proposes all FS modifications

• Majority must approve a modification for it to be committed

Page 13: BigDataCloud Sept 8 2011 Meetup - Fail-Proofing Hadoop Clusters with Automatic Service Failover by Mike Dalton of Zettaset

Example: HBase Sept 8th 2011 Meetup

• Apache HBase has full automated multi-master failover

• Prospective masters register in ZooKeeper

• ZooKeeper ephemeral/sequential files used for election

• Clients lookup current address of master in ZooKeeper

• Failover fully automated

• All files stored on HDFS, so no replication issues

Page 14: BigDataCloud Sept 8 2011 Meetup - Fail-Proofing Hadoop Clusters with Automatic Service Failover by Mike Dalton of Zettaset

Failover: Replication Sept 8th 2011 Meetup

• HBase approach avoids replication issues with HDFS

• Kerberos, NN, Oozie, etc can't use HDFS

• Legacy compatibility (and for NN, circular deps)

• How can we add synchronous write replication?

• Can't break compatibility or change apps

Page 15: BigDataCloud Sept 8 2011 Meetup - Fail-Proofing Hadoop Clusters with Automatic Service Failover by Mike Dalton of Zettaset

Failover: Networking Sept 8th 2011 Meetup

• HBase avoids networking failover by storing master address in ZK

• Legacy services use IP or hostnames, not ZK, to connect to master

• Out-of-trunk patches to make ZK a DNS server

• But Java doesn't respect DNS TTLs anyway, complicating max time for failover

Page 16: BigDataCloud Sept 8 2011 Meetup - Fail-Proofing Hadoop Clusters with Automatic Service Failover by Mike Dalton of Zettaset

Failover: Networking Sept 8th 2011 Meetup

• HBase avoids networking failover by storing master address in ZK

• Legacy services use IP or hostnames, not ZK, to connect to master

• Out-of-trunk patches to make ZK a DNS server

• But Java doesn't respect DNS TTLs anyway, complicating max time for failover

• DNS introduces its own issues anyway...

Page 17: BigDataCloud Sept 8 2011 Meetup - Fail-Proofing Hadoop Clusters with Automatic Service Failover by Mike Dalton of Zettaset

IP Failover Sept 8th 2011 Meetup

• Instead, you can failover IP addresses

• Virtual IPs – if supported by router

• Otherwise, dynamically update routes as part of your failover

• New leader updates routing tables.

• For local area networks, ensure ARP tables updated

• Gratuitous ARP or store ARP information in ZK

Page 18: BigDataCloud Sept 8 2011 Meetup - Fail-Proofing Hadoop Clusters with Automatic Service Failover by Mike Dalton of Zettaset

Putting it all together Sept 8th 2011 Meetup

• Consensus/Election

• Use ZooKeeper, 3-9 node quorum

• State Replication

• Small data in ZK, Large data in HDFS

• If neither possible, DRBD

• Network Failover

• Store master address in ZK

• Or, perform IP failover

• Dynamically update routing tables, update ARPcache

Page 19: BigDataCloud Sept 8 2011 Meetup - Fail-Proofing Hadoop Clusters with Automatic Service Failover by Mike Dalton of Zettaset

Conclusion Sept 8th 2011 Meetup

• Fully automated failover is possible

• Design for synchronous replication

• Prevent split-brain

• Manage legacy compatibility

• Coming to Hadoop

• ZettaSet provides fully HA Hadoop