Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
In this article, you configure and deploy Apache Airflow on Azure Kubernetes Service (AKS) using Helm.
Configure workload identity
Create a namespace for the Airflow cluster using the
kubectl create namespace
command.kubectl create namespace ${AKS_AIRFLOW_NAMESPACE} --dry-run=client --output yaml | kubectl apply -f -
Example output:
namespace/airflow created
Create a service account and configure workload identity using the
kubectl apply
command.export TENANT_ID=$(az account show --query tenantId -o tsv) cat <<EOF | kubectl apply -f - apiVersion: v1 kind: ServiceAccount metadata: annotations: azure.workload.identity/client-id: "${MY_IDENTITY_NAME_CLIENT_ID}" azure.workload.identity/tenant-id: "${TENANT_ID}" name: "${SERVICE_ACCOUNT_NAME}" namespace: "${AKS_AIRFLOW_NAMESPACE}" EOF
Example output:
serviceaccount/airflow created
Install the External Secrets Operator
In this section, we use Helm to install the External Secrets Operator. The External Secrets Operator is a Kubernetes operator that manages the lifecycle of external secrets stored in external secret stores like Azure Key Vault.
Add the External Secrets Helm repository and update the repository using the
helm repo add
andhelm repo update
commands.helm repo add external-secrets https://charts.external-secrets.io helm repo update
Example output:
Hang tight while we grab the latest from your chart repositories... ...Successfully got an update from the "external-secrets" chart repository
Install the External Secrets Operator using the
helm install
command.helm install external-secrets \ external-secrets/external-secrets \ --namespace ${AKS_AIRFLOW_NAMESPACE} \ --create-namespace \ --set installCRDs=true \ --wait
Example output:
NAME: external-secrets LAST DEPLOYED: Thu Nov 7 11:16:07 2024 NAMESPACE: airflow STATUS: deployed REVISION: 1 TEST SUITE: None NOTES: external-secrets has been deployed successfully in namespace airflow! In order to begin using ExternalSecrets, you will need to set up a SecretStore or ClusterSecretStore resource (for example, by creating a 'vault' SecretStore). More information on the different types of SecretStores and how to configure them can be found in our Github: https://github.com/external-secrets/external-secrets
Create secrets
Create a
SecretStore
resource to access the Airflow passwords stored in your key vault using thekubectl apply
command.kubectl apply -f - <<EOF apiVersion: external-secrets.io/v1beta1 kind: SecretStore metadata: name: azure-store namespace: ${AKS_AIRFLOW_NAMESPACE} spec: provider: # provider type: azure keyvault azurekv: authType: WorkloadIdentity vaultUrl: "${KEYVAULTURL}" serviceAccountRef: name: ${SERVICE_ACCOUNT_NAME} EOF
Example output:
secretstore.external-secrets.io/azure-store created
Create an
ExternalSecret
resource, which creates a KubernetesSecret
in theairflow
namespace with theAirflow
secrets stored in your key vault, using thekubectl apply
command.kubectl apply -f - <<EOF apiVersion: external-secrets.io/v1beta1 kind: ExternalSecret metadata: name: airflow-aks-azure-logs-secrets namespace: ${AKS_AIRFLOW_NAMESPACE} spec: refreshInterval: 1h secretStoreRef: kind: SecretStore name: azure-store target: name: ${AKS_AIRFLOW_LOGS_STORAGE_SECRET_NAME} creationPolicy: Owner data: # name of the SECRET in the Azure KV (no prefix is by default a SECRET) - secretKey: azurestorageaccountname remoteRef: key: AKS-AIRFLOW-LOGS-STORAGE-ACCOUNT-NAME - secretKey: azurestorageaccountkey remoteRef: key: AKS-AIRFLOW-LOGS-STORAGE-ACCOUNT-KEY EOF
Example output:
externalsecret.external-secrets.io/airflow-aks-azure-logs-secrets created
Create a federated credential using the
az identity federated-credential create
command.az identity federated-credential create \ --name external-secret-operator \ --identity-name ${MY_IDENTITY_NAME} \ --resource-group ${MY_RESOURCE_GROUP_NAME} \ --issuer ${OIDC_URL} \ --subject system:serviceaccount:${AKS_AIRFLOW_NAMESPACE}:${SERVICE_ACCOUNT_NAME} \ --output table
Example output:
Issuer Name ResourceGroup Subject ----------------------------------------------------------------------------------------------------------------------- ------------------------ ----------------------- ------------------------------------- https://$MY_LOCATION.oic.prod-aks.azure.com/c2c2c2c2-dddd-eeee-ffff-a3a3a3a3a3a3/aaaa0a0a-bb1b-cc2c-dd3d-eeeeee4e4e4e/ external-secret-operator $MY_RESOURCE_GROUP_NAME system:serviceaccount:airflow:airflow
Give permission to the user-assigned identity to access the secret using the
az keyvault set-policy
command.az keyvault set-policy --name $MY_KEYVAULT_NAME --object-id $MY_IDENTITY_NAME_PRINCIPAL_ID --secret-permissions get --output table
Example output:
Location Name ResourceGroup ------------- ---------------------- ----------------------- $MY_LOCATION $MY_KEYVAULT_NAME $MY_RESOURCE_GROUP_NAME
Create a persistent volume for Apache Airflow logs
Create a persistent volume using the
kubectl apply
command.kubectl apply -f - <<EOF apiVersion: v1 kind: PersistentVolume metadata: name: pv-airflow-logs labels: type: local spec: capacity: storage: 5Gi accessModes: - ReadWriteMany persistentVolumeReclaimPolicy: Retain # If set as "Delete" container would be removed after pvc deletion storageClassName: azureblob-fuse-premium mountOptions: - -o allow_other - --file-cache-timeout-in-seconds=120 csi: driver: blob.csi.azure.com readOnly: false volumeHandle: airflow-logs-1 volumeAttributes: resourceGroup: ${MY_RESOURCE_GROUP_NAME} storageAccount: ${AKS_AIRFLOW_LOGS_STORAGE_ACCOUNT_NAME} containerName: ${AKS_AIRFLOW_LOGS_STORAGE_CONTAINER_NAME} nodeStageSecretRef: name: ${AKS_AIRFLOW_LOGS_STORAGE_SECRET_NAME} namespace: ${AKS_AIRFLOW_NAMESPACE} EOF
Example output:
persistentvolume/pv-airflow-logs created
Create a persistent volume claim for Apache Airflow logs
Create a persistent volume claim using the
kubectl apply
command.kubectl apply -f - <<EOF apiVersion: v1 kind: PersistentVolumeClaim metadata: name: pvc-airflow-logs namespace: ${AKS_AIRFLOW_NAMESPACE} spec: storageClassName: azureblob-fuse-premium accessModes: - ReadWriteMany resources: requests: storage: 5Gi volumeName: pv-airflow-logs EOF
Example output:
persistentvolumeclaim/pvc-airflow-logs created
Deploy Apache Airflow using Helm
Configure an
airflow_values.yaml
file to change the default deployment configurations for the chart and update the container registry for the images.cat <<EOF> airflow_values.yaml images: airflow: repository: $MY_ACR_REGISTRY.azurecr.cn/airflow tag: 2.9.3 # Specifying digest takes precedence over tag. digest: ~ pullPolicy: IfNotPresent # To avoid images with user code, you can turn this to 'true' and # all the 'run-airflow-migrations' and 'wait-for-airflow-migrations' containers/jobs # will use the images from 'defaultAirflowRepository:defaultAirflowTag' values # to run and wait for DB migrations . useDefaultImageForMigration: false # timeout (in seconds) for airflow-migrations to complete migrationsWaitTimeout: 60 pod_template: # Note that `images.pod_template.repository` and `images.pod_template.tag` parameters # can be overridden in `config.kubernetes` section. So for these parameters to have effect # `config.kubernetes.worker_container_repository` and `config.kubernetes.worker_container_tag` # must be not set . repository: $MY_ACR_REGISTRY.azurecr.cn/airflow tag: 2.9.3 pullPolicy: IfNotPresent flower: repository: $MY_ACR_REGISTRY.azurecr.cn/airflow tag: 2.9.3 pullPolicy: IfNotPresent statsd: repository: $MY_ACR_REGISTRY.azurecr.cn/statsd-exporter tag: v0.26.1 pullPolicy: IfNotPresent pgbouncer: repository: $MY_ACR_REGISTRY.azurecr.cn/airflow tag: airflow-pgbouncer-2024.01.19-1.21.0 pullPolicy: IfNotPresent pgbouncerExporter: repository: $MY_ACR_REGISTRY.azurecr.cn/airflow tag: airflow-pgbouncer-exporter-2024.06.18-0.17.0 pullPolicy: IfNotPresent gitSync: repository: $MY_ACR_REGISTRY.azurecr.cn/git-sync tag: v4.1.0 pullPolicy: IfNotPresent # Airflow executor executor: "KubernetesExecutor" # Environment variables for all airflow containers env: - name: ENVIRONMENT value: dev extraEnv: | - name: AIRFLOW__CORE__DEFAULT_TIMEZONE value: ' China/Shang_hai' # Configuration for postgresql subchart # Not recommended for production! Instead, spin up your own Postgresql server and use the `data` attribute in this # yaml file. postgresql: enabled: true # Enable pgbouncer. See https://airflow.apache.org/docs/helm-chart/stable/production-guide.html#pgbouncer pgbouncer: enabled: true dags: gitSync: enabled: true repo: https://github.com/donhighmsft/airflowexamples.git branch: main rev: HEAD depth: 1 maxFailures: 0 subPath: "dags" # sshKeySecret: airflow-git-ssh-secret # knownHosts: | # github.com ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQCj7ndNxQowgcQnjshcLrqPEiiphnt+VTTvDP6mHBL9j1aNUkY4Ue1gvwnGLVlOhGeYrnZaMgRK6+PKCUXaDbC7qtbW8gIkhL7aGCsOr/C56SJMy/BCZfxd1nWzAOxSDPgVsmerOBYfNqltV9/hWCqBywINIR+5dIg6JTJ72pcEpEjcYgXkE2YEFXV1JHnsKgbLWNlhScqb2UmyRkQyytRLtL+38TGxkxCflmO+5Z8CSSNY7GidjMIZ7Q4zMjA2n1nGrlTDkzwDCsw+wqFPGQA179cnfGWOWRVruj16z6XyvxvjJwbz0wQZ75XK5tKSb7FNyeIEs4TT4jk+S4dhPeAUC5y+bDYirYgM4GC7uEnztnZyaVWQ7B381AK4Qdrwt51ZqExKbQpTUNn+EjqoTwvqNj4kqx5QUCI0ThS/YkOxJCXmPUWZbhjpCg56i+2aB6CmK2JGhn57K5mj0MNdBXA4/WnwH6XoPWJzK5Nyu2zB3nAZp+S5hpQs+p1vN1/wsjk= logs: persistence: enabled: true existingClaim: pvc-airflow-logs storageClassName: azureblob-fuse-premium # We disable the log groomer sidecar because we use Azure Blob Storage for logs, with lifecyle policy set. triggerer: logGroomerSidecar: enabled: false scheduler: logGroomerSidecar: enabled: false workers: logGroomerSidecar: enabled: false EOF
Add the Apache Airflow Helm repository and update the repository using the
helm repo add
andhelm repo update
commands.helm repo add apache-airflow https://airflow.apache.org helm repo update
Example output:
"apache-airflow" has been added to your repositories Hang tight while we grab the latest from your chart repositories... ...Successfully got an update from the "apache-airflow" chart repository
Search the Helm repository for the Apache Airflow chart using the
helm search repo
command.helm search repo airflow
Example output:
NAME CHART VERSION APP VERSION DESCRIPTION apache-airflow/airflow 1.15.0 2.9.3 The official Helm chart to deploy Apache Airflo...
Install the Apache Airflow chart using the
helm install
command.helm install airflow apache-airflow/airflow --namespace airflow --create-namespace -f airflow_values.yaml --debug
Example output:
NAME: airflow LAST DEPLOYED: Fri Nov 8 11:59:43 2024 NAMESPACE: airflow STATUS: deployed REVISION: 1 TEST SUITE: None NOTES: Thank you for installing Apache Airflow 2.9.3! Your release is named airflow. You can now access your dashboard(s) by executing the following command(s) and visiting the corresponding port at localhost in your browser: Airflow Webserver: kubectl port-forward svc/airflow-webserver 8080:8080 --namespace airflow Default Webserver (Airflow UI) Login credentials: username: admin password: admin Default Postgres connection credentials: username: postgres password: postgres port: 5432 You can get Fernet Key value by running the following: echo Fernet Key: $(kubectl get secret --namespace airflow airflow-fernet-key -o jsonpath="{.data.fernet-key}" | base64 --decode) ########################################################### # WARNING: You should set a static webserver secret key # ########################################################### You are using a dynamically generated webserver secret key, which can lead to unnecessary restarts of your Airflow components. Information on how to set a static webserver secret key can be found here: https://airflow.apache.org/docs/helm-chart/stable/production-guide.html#webserver-secret-key
Verify the installation using the
kubectl get pods
command.kubectl get pods -n airflow
Example output:
NAME READY STATUS RESTARTS AGE airflow-create-user-kklqf 1/1 Running 0 12s airflow-pgbouncer-d7bf9f649-25fnt 2/2 Running 0 61s airflow-postgresql-0 1/1 Running 0 61s airflow-run-airflow-migrations-zns2b 0/1 Completed 0 60s airflow-scheduler-5c45c6dbdd-7t6hv 1/2 Running 0 61s airflow-statsd-6df8564664-6rbw8 1/1 Running 0 61s airflow-triggerer-0 2/2 Running 0 61s airflow-webserver-7df76f944c-vcd5s 0/1 Running 0 61s external-secrets-748f44c8b8-w7qrk 1/1 Running 0 3h6m external-secrets-cert-controller-57b9f4cb7c-vl4m8 1/1 Running 0 3h6m external-secrets-webhook-5954b69786-69rlp 1/1 Running 0 3h6m
Access Airflow UI
Securely access the Airflow UI through port-forwarding using the
kubectl port-forward
command.kubectl port-forward svc/airflow-webserver 8080:8080 -n airflow
Open your browser and navigate to
localhost:8080
to access the Airflow UI.Use the default webserver URL and login credentials provided during the Airflow Helm chart installation to log in.
Explore and manage your workflows securely through the Airflow UI.
Integrate Git with Airflow
Integrating Git with Apache Airflow enables seamless version control and streamlined management of your workflow definitions, ensuring that all DAGs are both organized and easily auditable.
- Set up a Git repository for DAGs. Create a dedicated Git repository to house all your Airflow DAG definitions. This repository serves as the central source of truth for your workflows, allowing you to manage, track, and collaborate on DAGs effectively.
- Configure Airflow to sync DAGs from Git. Update Airflow’s configuration to automatically pull DAGs from your Git repository by setting the Git repository URL and any required authentication credentials directly in Airflow’s configuration files or through Helm chart values. This setup enables automated synchronization of DAGs, ensuring that Airflow is always up to date with the latest version of your workflows.
This integration enhances the development and deployment workflow by introducing full version control, enabling rollbacks, and supporting team collaboration in a production-grade setup.
Make your Airflow on Kubernetes production-grade
The following best practices can help you make your Apache Airflow on Kubernetes deployment production-grade:
- Ensure you have a robust setup focused on scalability, security, and reliability.
- Use dedicated, autoscaling nodes, and select a resilient executor like KubernetesExecutor, CeleryExecutor, or CeleryKubernetesExecutor.
- Use a managed, high-availability database back end like MySQL or PostgreSQL.
- Establish comprehensive monitoring and centralized logging to maintain performance insights.
- Secure your environment with network policies, SSL, and Role-Based Access Control (RBAC), and configure Airflow components (Scheduler, Web Server, Workers) for high availability.
- Implement CI/CD pipelines for smooth DAG deployment, and set up regular backups for disaster recovery.
Next steps
To learn more about deploy open-source software on Azure Kubernetes Service (AKS), see the following articles:
- Deploy a MongoDB cluster on Azure Kubernetes Service (AKS)
- Deploy a highly available PostgreSQL database on Azure Kubernetes Service (AKS)
- Deploy a Valkey cluster on Azure Kubernetes Service (AKS)
Contributors
Microsoft maintains this article. The following contributors originally wrote it:
- Don High | Principal Customer Engineer
- Satya Chandragiri | Senior Digital Cloud Solution Architect
- Erin Schaffer | Content Developer 2