Airflow Clustering and High Availability

Airflow Clustering and

High AvailabilityBy: Robert Sanders

2Page:

Agenda

• Airflow Daemons• Single Node Deployment• Cluster Deployment• Scaling

• Worker Nodes• Master Nodes

• Limitations• Airflow Scheduler Failover Controller• Failover Controller Procedure

3Page:

Airflow Daemons

• Web Server• Daemon that runs the Airflow Webserver• 1 to many gunicorn processes to accept and process requests in

parallel.• Allows you to track jobs progress, run jobs and more

• Scheduler• Periodically runs (every X seconds) to determine if a DAG or Task

needs to be ran based off the DAG schedule• Pushes messages to the Queuing Service to be executed

• Worker• Daemon runs if you’re using the CeleryExecutors (as opposed to

SequentialExecutor and LocalExecutor)• 1 to many dedicated celeryd processes which execute functions• Pulls messages from a Queuing services to determine what

functions to execute

4Page:

Single Node Deployment

5Page:

Cluster Deployment

6Page:

Why setup a Cluster Deployment?

• Distributes heavy processes onto many machines for better use of resources

• More Highly Available Airflow environment• If you have many Workflows with many Tasks your executors

would not be able to get to all the messages in the queue. Adding more executors would fix this issue.

7Page:

Scaling Workers

• Horizontally• Add more machines to the cluster• No need to register the machines with the master. You

just need to start up the Airflow Worker task on the new Machine.

• Vertically• Increase the number of executors (celeryd processes) per

node and restart the workers

8Page:

Scaling Master

9Page:

Limitations

• There can only be one scheduler running at a time• If you have multiple Scheduler processes running, there's

a possibility that multiple instances of a single task that will be scheduled to run.

• If the Scheduler Daemon or Machine with the process goes down then no jobs will get scheduled

10Page:

Airflow Scheduler Failover Controller

• Dedicated Daemon that runs with Airflow on the Master Nodes

• Ensures that there is always one and only one Scheduler running on the Master nodes at a time

• Developed Internally and Open Sourced• https://github.com/teamclairvoyant/airflow-scheduler-fail

over-controller

• High Level Steps• Polls (every x seconds) to check if the scheduler is

running• If scheduler isn’t running, restart the scheduler• If it still doesn’t start up, then try starting it up on the

other master nodes

11Page:

Failover Controller Diagram

12Page:

Start Up Scenario

13Page:

Failover Controller Process (Start Up)

Master Node 1

Failover Controller(standby)

Master Node 2

On startup, the processes start out in STANDBY

14Page:

Master Node 1

Failover Controller(active)

Master Node 2

The first one to enter data into the Metastore is elected as the active controller.

15Page:

Scheduler

Master Node 1

Master Node 2

The Failover controller checks to see if the Scheduler is running, but it isn’t.

16Page:

Scheduler

Master Node 1

Master Node 2

Failover Controller starts up the Scheduler

17Page:

Scheduler Failure Scenario

18Page:

Failover Controller Process (Process Failure)

Scheduler

Master Node 1

Master Node 2

Scheduler process has died

19Page:

Failover Controller Process (Process Failure)

Scheduler

Master Node 1

Master Node 2

Failover Controller restarts the Scheduler

20Page:

Scheduler Failure and Failed Restart

Scenario

21Page:

Failover Controller Process (Process Failure 2)

Scheduler

Master Node 1

Master Node 2

Scheduler process has died

22Page:

Scheduler

Master Node 1

Master Node 2

Failover Controller tries to restart the Scheduler, but its still not running

23Page:

Scheduler

Master Node 1

Master Node 2

Failover Controller tries to restart the Scheduler on a different node

24Page:

Scheduler

Master Node 1

Master Node 2

Failover Controller succeeds to restart the scheduler and the cluster is back to normal

25Page:

Node Failure Scenario

26Page:

Failover Controller Process (Node Failure)

Scheduler

Master Node 1

Master Node 2

Everything is running as expected

27Page:

Scheduler

Master Node 1

Failover Controller

(dead)

Master Node 2

Master Node 1 dies and all the processes running on it are gone

28Page:

Scheduler

Master Node 1

Failover Controller

(dead)

Master Node 2

Failover Controller on Master 2 becomes active because the one running on Master Node 1 has stopped sending a heart beat

29Page:

Scheduler

Master Node 1

Failover Controller

(dead)

Master Node 2

The newly active Failover Controller tries to check-in with and restart the Scheduler on the daemon the Metadata says its running on and fails.

30Page:

Scheduler

Master Node 1

Failover Controller

(dead)

Master Node 2

The Failover Controller then starts it on another node and it succeeds

Scheduler

31Page:

Master Node 1

Master Node 2

When Master Node 1 is brought back, the old Failover Controller goes into STANDBY state

Scheduler

32Page:

Airflow Clustering and High Availability

Software

An Empirical Examination of Current High-Availability Clustering Solutions’ Performance

QualiSystems versions/7.0/7.0... · The SQL server database clustering uses the AlwaysOn Availability Group solution. AlwaysOn Availability Groups (SQL Server) The AlwaysOn Availability

Advanced liferay architecture clustering and high availability

Clustering and High Availability for Enterprise Tools …thesmartpanda.com/wp-content/uploads/2011/06/Tools_Clustering.pdfClustering and High Availability for ... Configuring an Oracle

Dell EMC PowerStore: Clustering and High Availability · Introduction 6 Dell EMC PowerStore: Clustering and High Availability | H18157 1 Introduction Having constant access to data

Failover control - probabilitylectures.narod.ruprobabilitylectures.narod.ru/...Failover_control.pdf · Key availability improvements include enhancements to Failover Clustering such

White Paper: Clustering NiceLabel · Clustering NiceLabel . ... • To provide failover capability and increase availability of services and ... example Network Load Balancing service

High Availability with Windows Server Clustering and Geo-Clustering

1 Windows Server 2008 R2 Feature Roadmap Clustering & High Availability Failover Clustering Network Load Balancing

High Availability Clustering With Alfresco

Hhm 3479 mq clustering and shared queues for high availability

For your high availability and clustering needs Percona ... · For your high availability and clustering needs Ramesh Sivaraman Krunal Bauskar. Agenda What is Good HA eco-system ?

CLUSTERING CAS for High Availability - Apereo CAS for High Availability Eric Pierce, University of South Florida • High Availability Basics • Before Clustering CAS • Failover

SVR208 Gaining Higher Availability with Windows Server 2008 R2 Failover Clustering

Implementasi dan Analisis High Availability Server d ......Implementasi . d. an Analisis High Availability Server dengan Teknik Failover Clustering Menggunakan Heartbeat . Artikel

Implementing High Availability Clustering Multi …Implementing High Availability Cluster Multi-Processing (HACMP) Cookbook Octavian Lascu Shawn Bodily Maria-Katharina Esser Michael

What Every MCT Needs to Know about Clustering and High Availability Rodney R. Fournier Microsoft MVP - Windows Server - Clustering Net Working America,

11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW Describe the clustering capabilities of Microsoft Windows

HIGH AVAILABILITY OF SQL SERVER - PASS Koprowski.pdf · High Availability Solutions –part 2 • Database Mirroring • Database Snapshots • Windows Clustering • SQL Server Replication

Module 10: Maintaining High-Availability. Overview Introduction to Availability Increasing Availability Using Failover Clustering Standby Servers and