Skip to content

Commit b8b5205

Browse files
author
Dmitry Suvorov
committed
[PE-3860]: A blog about our Airflow upgrade journey
Here I've collected all main actions we have done, issues we have faced and solutions we have made during Airflow upgrade from version `2.2.0` to `2.5.3`
1 parent 6ef38bb commit b8b5205

File tree

2 files changed

+129
-0
lines changed

2 files changed

+129
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,129 @@
1+
---
2+
layout: post
3+
title: "Airflow upgrade from 2.2.0 to 2.5.3 journey"
4+
author: dimonchik-suvorov
5+
tags:
6+
- featured
7+
- airflow-series
8+
- aws
9+
team: Data Platform
10+
---
11+
12+
Let me continue QP's `airflow-series` rubric and tell you a story of one Airflow upgrade. In this post I'm going to describe what infrastructure we had before, how we improved it and what difficulties we faced along the way so let's go!
13+
14+
### At the beginning there were...
15+
16+
Main parts of our Airflow infrastructure:
17+
- Old but gold Airflow `2.2.0`
18+
- Our private fork with custom changes that we didn't upstream yet
19+
- Airflow Image based on `python3.7` Docker image stored in AWS ECR (Elastic Container Registry)
20+
- We are installing Airflow in `--editable` mode each time to apply our custom changes
21+
- Scheduler and Web Server running on AWS ECS Fargate (Elastic Container Service)
22+
- Terraform for configuring ECS tasks and Airflow params
23+
- Kubernetes Executor on AWS EKS `1.24` (Elastic Kubernetes Service)
24+
- Airflow backend - MySQL `5.7` on RDS (Relational Database Service)
25+
26+
Also, we have a couple of our own Airflow plugins like custom operators, hooks and _Self-Service Backfill UI_ (you can learn more about this in our [Airflow Summit 2021 slides](https://airflowsummit.org/slides/2021/e5_6-ModernizeAirflow-Scribd.pdf)).
27+
And Jenkins pipeline for building Airflow docker image and apply Terraform to restart ECS tasks with new configurations and Airflow image.
28+
29+
### Why?
30+
31+
We decided to upgrade because of released features and fixed bugs obviously, but the most valuable thing we wanted is multiple Schedulers because our Fargate instance came close to it's max available resources, and we suffered from constant high CPU utilization.
32+
33+
This caused MySQL upgrade because multiple Schedulers doesn't work with 5.7 which doesn't have that locking functionality we need...
34+
35+
With all the rest we decided that this is a good idea to also bump Python version from `3.7` to `3.10` (which, as you already may understand, wasn't)
36+
37+
### As promised - the journey!
38+
39+
Before everything we read all release notes started from 2.2.0 to 2.5.3 simply to understand what we should be waiting for. Created a list of possible issues we can face with. We thought we prepared and fearlessly made first move.
40+
41+
#### Database upgrade
42+
43+
Beforehand of the Airflow upgrade we decided to bump our MySQL version from `5.7` to `8` in advance. This didn't caused any troubles because Airflow `2.2.0` works fine with MySQL `8`. Another reason to bump database version in advance was to reduce the complexity of the Airflow upgrade and time it will take.
44+
45+
#### Testing - yes / no / maybe so?
46+
From time to time it's good to test everything before upgrading it on Production - in order to test all integrations we created separate terraform module that takes our development env like ECS tasks for Scheduler and Webserver, Database from a snapshot and creates all infrastructure that Airflow needs right beside Dev without any intersections with it, so we could test everything without fear of breaking things. But, the main beauty of this is in quick recreation - ~5 minutes to redeploy it from the scratch in case something went irreparably wrong. Also, separately, we've been testing some parts locally using Breeze and our own custom Airflow image build.
47+
48+
After plain testing integrations and parts a next step was the performance testing step. We didn't enable multi-scheduler feature and wanted to see how it works. Spoiler - it works almost the same so far. A bit more intensively uses cores and memory now but the issue we had before is gone now. The issue was that old Scheduler was using 100% of CPUs from time to time and only restarting it helped (big long spike on the screenshot).
49+
For performance testing we have created autogenerated DAG with more than 1000 dummy PythonOperators (this is an approximate number of tasks in our main DAG). Those operators have random `sleep` to test concurrency and things.
50+
![](/post-images/2023-06-07-airflow-upgrade-to-2-5-3/ecs-performance-metrics.png)
51+
<font size="3"><center><i>Databricks cluster configuration parameters</i></center></font>
52+
53+
#### Updating code base
54+
We started from merging `2.5-stable` into our main fork branch. After all merge conflicts were resolved (the easiest part of the upgrade I'd say) we started testing...
55+
56+
This wasn't exactly what you can call a smooth upgrade - switching to the new Airflow version caused a lot of errors in our plugins and services that are heavily rely on the Airflow core/providers code.
57+
58+
Issues we knew about and/or caught during the testing were:
59+
- in our custom DAGs dump code - internal structure of the DAG class slightly changed (like `_BaseOperator__init_kwargs`, `TaskGroup`, `ParamsDict`, etc.), and it caused compilation errors
60+
- custom DAGs validations (DummyOperator deprecated in favor of EmptyOperator)
61+
- in custom operators `execute` function where we were suing TaskInstance `key` and it's changed to DagRun's `run_id`
62+
- some of `timetable` functions also changed, and we had to find new ways to do what we did before merging (like `dag.timetable.infer_data_interval(your_execution_date).end` became `dag.timetable.infer_manual_data_interval(run_after=exec_dt).start`)
63+
- `node:12.22.6` Docker image we used for building npm (`airflow/www/static/dist`) was too old for the new and shiny Airflow (switched to the `node:16.0.0`)
64+
- `TriggerRuleDep` changed, and our custom rules that were using/overriding `_get_dep_statuses` and `_evaluate_trigger_rule` also started to fail
65+
- Okta integration started to fail because of new Flask AppBuilder (simply adding `server_metadata_url` to existing configuration solved the issue)
66+
- some Airflow configuration params (like `AIRFLOW__CORE__SQL_ALCHEMY_CONN` -> `AIRFLOW__DATABASE__SQL_ALCHEMY_CONN`) were changes
67+
- `AwsLambdaHook` changed to `LambdaHook` in the `AWS` provider. `function_name` parameter was moved from the `AwsLambdaHook.init()` function to the `invoke_lambda` function (put it closer to the execution)
68+
- BaseOperator's `task_concurrency` parameter changed to `max_active_tis_per_dag`
69+
- `ResultProxy` and `RowProxy` classes from `sqlalchemy.engine.result` changed their names to the `Result` and `Row`
70+
- DAG's `schedule_interval` param changed to the `schedule`
71+
- our Unit tests also broke because we mocked TaskInstance and other classes but their internal structure changed
72+
73+
Wasn't that hard, agree? By the way during the concurrency and performance testing we have learned that our Kubernetes executor isn't optimal - `AIRFLOW__KUBERNETES_EXECUTOR__WORKER_PODS_CREATION_BATCH_SIZE=1` by default didn't correspond to our needs - we have bumped this param to `50`, and it becomes significantly faster in starting new tasks.
74+
75+
#### Apply DB migrations gently
76+
77+
In parallel with testing the code we've been testing Airflow to database connectivity (`airflow db check` command) and how Airflow database migrations apply to our database (`airfow db upgrade`). How to upgrade Airflow database could be found [here](https://airflow.apache.org/docs/apache-airflow/2.5.3/installation/upgrading.html). What we have learned:
78+
- some providers changed their packages, and most significant thing was that Airflow folks introduced `common.sql` provider and moved there some functionality from the `databricks` provider (which we also use a lot). Because of [Broken installation with --editable flag](https://github.com/apache/airflow/issues/30764) we faced an issue with Airflow communication with the database. I've described few workarounds in the ticket, but the root cause was that we upgraded Python (and as consequence - pip version).
79+
- new Docker image with Python `3.10` came with the new Debian `bullseye` and there was a change important for us - `mysql-client` lib was stopped packaging and became `default-mysql-client`
80+
- `airfow db upgrade`generated `dangling` tables and put there all DAGs, TaskInstances and other entities that couldn't be migrated. For example if DAG was without ID or execution_date. Don't ask me how that happened, but we reviewed these tables and successfully dropped them.
81+
- applying migrations took ~2 hours on our weak dev RDS cluster and we thought it will be faster on production... boy oh boy we were wrong...
82+
- our plain Airflow `User` role lost access to the custom _Self-Service Backfill UI_. Solved this by setting `AIRFLOW__WEBSERVER__UPDATE_FAB_PERMS` to `False` because otherwise it drops custom permissions each time Airflow Web Server restarts
83+
84+
#### Ship it!
85+
86+
Finally, everything was tested and fixed. The time has come to upgrade Airflow on production! All DAGs paused, ECS tasks stopped. Production database snapshot created.
87+
88+
###### Database migration
89+
We started applying migrations on production database. In order to do this we prepared another ECS task which runs `airfow db upgrade` command. We successfully deployed it via Terraform and start to wait... and wait... and wait a little bit more...
90+
91+
After ~5 hours we started to think why it takes so long... We realized that:
92+
- our production Airflow database RDS cluster only twice as big as our development
93+
- we have giant database with data for 2017 year and so on (backfills mostly). Considering we had a `sentinel` DAG which checks Scheduler is alive for each 5 minutes (sends a gauge metric to the Datadog), and our main DAG with ~1k tasks in it running daily, you can imagine how big `task_instance` and `dag_run` tables were...
94+
- ???Missed index in the database
95+
96+
97+
What we decided to do:
98+
- stop the migration
99+
- give our production RDS cluster more resources - 32Gb RAM and 16 CPUs comparing to 8Gb RAM and 4 CPUs that we had before
100+
- delete all data below year 2021
101+
- rerun migration ECS task again
102+
103+
That helped - migration completed in ~30 minutes!
104+
105+
###### Airflow - your turn!
106+
107+
Deploy ECS tasks via Terraform with updated Airflow image didn't cause any issues.
108+
109+
After that we've merged all PR's with updated DAGs and related services. Checked that there are no DAG parsing errors on production.
110+
111+
Started all DAGs and start to observe... and it started to fail...
112+
- [Braking changes](https://github.com/apache/airflow/pull/26452/files#diff-55326294fdc9ce88aee820373ed658972dff4067517a9c4f59819efbbf3e3b85R27) in the SlackWebHook. Basically they refactored the hook and switched to the Slack SDK instead of using plain HTTP. This caused `SlackRequestError: Invalid URL detected: ***` for all our Slack hook usages. In order to fix it we had to went through all Slack Airflow Connections, copy `webhook_token` extra param, change the connection type to Slack Incoming Webhook and put copied value to the `Webhook Token` field
113+
- ???Bug in Databricks SQL operator they are creating new connection for each query. This breaks our templated queries because in most of them first query is `use some_database` and next is an actual processing query. But when it creates new connection for each query in a file it simply forgets about use statement and fails second query because can't find tables without database
114+
115+
After these issues were fixed we finally got a possibility to go to sleep because this is the end of the story, we made it!
116+
117+
### Conclusions
118+
Even if you think you have tested everything you could be wrong. Nevertheless, upgrade is finished and now we have Airflow 2.5.3 up and running.
119+
120+
Another decision we made - we decided to use constraints file during Airflow installation because without it some of Python modules could be loaded with newer versions during Airflow Docker image rebuild and caused dependencies conflicts.
121+
122+
Clear your database as much as possible in advance to reduce the number of rows that will be processed during the database migration
123+
124+
125+
### Credits
126+
127+
- [Maksym Dovhal](https://github.com/Maks-D) for creating separate Terraform module for testing, actual testing things, fixing bugs along the way and for being deployment commander
128+
- [Kuntal Basu](https://github.com/kuntalkumarbasu) - our infrastructure magician and guru for his help in real-time issues detection and fixing during the deployment
129+
- [Artur Kiiko](https://github.com/arturkii) and [Lakshmi Pernapati](https://github.com/lpernapati) for participation in local/integration testing and fixing issues
Loading

0 commit comments

Comments
 (0)