'Kubernetes Operator in Airflow is not sharing the load across nodes. Why?
I have airflow 1.10.5 on a Kubernetes cluster.
The DAGs are written with Kubernetes operator so that they can spin pods for each task inside the DAG on execution, on the k8s cluster.
I have 10 worker nodes.
The pods created by airflow are being created on the same node, where airflow is running. When many pods have to spin up, they all are queued on the same node, which makes many pod failures due to lack of resources on the node.
At the same time, all other 9 nodes are being used very less, as we have huge load only for the airflow jobs.
How to make the airflow to use all the worker nodes of the k8s cluster?
I do not use any of node affinity or node selector.
Solution 1:[1]
Solved this little 'issue' by attaching affinity
to workers pods manually in Helm chart suggested by the document: Airflow Helm-Chart .
workers:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchLabels:
component: worker
topologyKey: kubernetes.io/hostname
weight: 100
The Airflow Helm Chart values.yaml
defines this affinity and says that it will be the default for worker.
affinity: {}
# default worker affinity is:
# podAntiAffinity:
# preferredDuringSchedulingIgnoredDuringExecution:
# - podAffinityTerm:
# labelSelector:
# matchLabels:
# component: worker
# topologyKey: kubernetes.io/hostname
# weight: 100
But it fails to mention that this doesn't apply to worker pod but only to worker deployment under CeleryExecutor
or CeleryKubernetesExecutor
in worker-deployment.yaml
.
...
################################
## Airflow Worker Deployment
#################################
{{- $persistence := .Values.workers.persistence.enabled }}
{{- if or (eq .Values.executor "CeleryExecutor") (eq .Values.executor "CeleryKubernetesExecutor") }}
...
So if your do want do spread out your worker pods more, you need to add this affinity(or other custom affinity) to your worker pod template, which can be done through Helm values.yaml.
Though i don't think that this will be considered as a 'issue' as most likely the certain node is free enough so Kubernetes keeps scheduling workers pods to it. When system load goes high, Kubernetes will spread out workers pods. And having worker pods in same node might reduce the network traffic between nodes in some cases.
But in my case when all worker pods are being scheduled to the same nodes, the pods initialization latency is higher than having a distributed workload. So i decide to spread them out across the cluster.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Yi Wu |