

Check out your branch with your new dependencies, and ‘pip install’ new packages on the scheduler and web server instances.Again, your staging scheduler is already disabled, so no DAG runs will be triggered. ‘Restore’ your production RDS snapshot into your staging RDS instance.Because this is our staging environment and a fresh Airflow database, no DAGs are active. Since we use an auto scaling group to manage our worker instances, we simply scaled its max capacity to 0. Stop the staging scheduler, web server, and worker instances.We recommend using the exact same instance sizes (CPU, memory, etc) for the best comparison. Create a snapshot of your production Airflow database in RDS we will use it to test the ‘airflow db upgrade’ command in staging. As mentioned earlier, we use CloudFormation, so spinning up a new stack based on our existing infrastructure configurations was trivial. Weeks before our planned migration date, we created a cloned stack in AWS (same Airflow 1.10.10 setup) for testing and validation.While our plan has many specific references to our AWS setup, the spirit of them should apply to any Airflow upgrade.
AIRFLOW 2.0 UPGRADE
Taking as much as we could into consideration, we came up with the following plan of action to upgrade Airflow. We’ll discuss the implications of these ‘shallow’ and ‘deep’ DAGs (as we call them) later, but know that there are important Airflow Scheduler throughput considerations depending on your reality.Īt RealScout, we aim for predictable, ‘boring’’ migrations that minimize downtime. This is probably different from many other teams’ typical setups, who may have DAGs containing hundreds or even thousands of tasks in some cases. This lets us utilize very low-powered EC2 Airflow worker instances, which is great for cost saving. Perhaps unique to our setup is that almost all of our DAGs are 1–2 tasks, the bulk of which just orchestrate and monitor ECS Tasks. Some maintenance or more intensive DAGs run once a day or once a week. The DAGs run on various schedules however, most of them run every 15 minutes or less.

We have about 1000 active DAGs in total, which has grown over the years as we’ve become more accustomed to (and trusting of) Airflow. Simplified version of our Airflow architecture at RealScout
AIRFLOW 2.0 CODE
One important consideration is that all of our Airflow instances (scheduler, web server, and workers) share a common /efs folder mount for airflow code and logs, which makes deployments and logging simpler. Our production stack includes 1 Scheduler instance, 1 Web Server instance, 10 Celery Worker instances and 1 RDS PostgreSQL instance.Īside from the instances, other important resources include SQS queues (used for Airflow’s Celery setup with SQS as a backend), an Auto Scaling Group for maintaining the desired number of worker instances, security groups to limit access to instances and the database, etc. For those unfamiliar, a CloudFormation template defines all resources and policies for a set of AWS infrastructure. Our existing Airflow stack is deployed on AWS, using the great Turbine project as a starting point for our CloudFormation template. In this post, we highlight our experiences, challenges encountered, and advice for others who may be planning their own upgrade. This upgrade was unique from previous Airflow upgrades due to both volume and type of changes included in 2.0. We are now about two weeks into the new Airflow deployment in production and are already seeing some very big wins, which we’ll detail in later posts in this series. Our team recently upgraded our stack from Airflow 1.10.10 to the long awaited 2.0. Upgrading to Airflow 2.0: Massive Performance Wins and Lessons LearnedĪt RealScout, we use Apache Airflow to orchestrate our crucial workflows such as data ingestions, health checks, and reconciliation processes.
