Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
In this article, you learn how to install and configure Kueue to schedule batch workloads on an Azure Kubernetes Service (AKS) cluster. You also explore different Kueue concepts, installation methods to enable advanced Kueue features, and learn how to verify your deployments.
Important
Open-source software is mentioned throughout AKS documentation and samples. Software that you deploy is excluded from AKS service-level agreements, limited warranty, and Azure support. As you use open-source technology alongside AKS, consult the support options available from the respective communities and project maintainers to develop a plan.
For example, the Ray GitHub repository describes several platforms that vary in response time, purpose, and support level.
Microsoft takes responsibility for building the open-source packages that we deploy on AKS. That responsibility includes having complete ownership of the build, scan, sign, validate, and hotfix process, along with control over the binaries in container images. For more information, see Vulnerability management for AKS and AKS support coverage.
What are batch workloads?
Batch deployments are typically non-interactive workloads that are retriable, have a finite duration, and might experience spiky or bursty resource usage. These workloads include, but aren't limited to:
- Data processing jobs.
- Security vulnerability scans.
- Media encoding or video transcoding.
- Report generation or financial analysis.
- GPU workloads that require all resources to be available and might tolerate a delayed start but can't tolerate partial GPU allocation.
These workloads are often modeled using a Kubernetes Job, CronJob, or custom resource definition (CRD) like RayJob or Kubeflow MPIJob. Batch deployments present the following set of distinct requirements from general purpose deployments:
- Scheduling logic beyond selecting the first available node.
- Fairness, queueing, and resource awareness.
- Lifecycle awareness of jobs and pods.
The default AKS scheduler satisfies the requirements of Kubernetes services but provides limited configuration for batch workloads that require a job queueing system.
What is Kueue?
Kueue is an open-source Kubernetes-native job queueing project designed to manage batch workloads and ensure efficient, fair, and policy-driven scheduling in Kubernetes clusters. Kueue integrates with the Kubernetes scheduling ecosystem to coordinate resource allocation, prioritization, and capacity control for batch jobs.
Important
Open-source software is mentioned throughout AKS documentation and samples. Software that you deploy is excluded from AKS service-level agreements, limited warranty, and Azure support. As you use open-source technology alongside AKS, consult the support options available from the respective communities and project maintainers to develop a plan.
For example, the Ray GitHub repository describes several platforms that vary in response time, purpose, and support level.
Microsoft takes responsibility for building the open-source packages that we deploy on AKS. That responsibility includes having complete ownership of the build, scan, sign, validate, and hotfix process, along with control over the binaries in container images. For more information, see Vulnerability management for AKS and AKS support coverage.
Kueue introduces a two-level queuing model:
- A
ClusterQueuerepresents shared resource pools (such as CPU, memory, GPU quotas). - A
LocalQueuerepresents a tenant-facing queue in a namespace (where users submit their batch jobs).
Workloads submitted to a LocalQueue are matched to a ClusterQueue to determine if they can be admitted.
Note
A LocalQueue is always needed for users to submit batch workloads, and the LocalQueue tells Kueue about which ClusterQueue to assign the job to. The ClusterQueue determines if sufficient resources are available for the job to be admitted and run.
Who can use Kueue?
Batch workload administrators (including platform or cluster administrators and DevOps engineers) and batch users (data scientists, developers, and ML engineers) can benefit from deploying workloads with Kueue on AKS.
A batch admin focuses on configuring, managing, and securing the platform-level infrastructure to support batch workloads, and have the following responsibilities:
- Provision and manage AKS node pools.
- Define resource quotas, ClusterQueues, and policies for workload isolation.
- Tune autoscaling and cost-efficiency (such as the Cluster Autoscaler or Kueue quotas).
- Monitor cluster and queue health.
- Create and maintain templates and reusable workflows.
A batch user runs compute-intensive or parallel jobs using the platform-level infrastructure configured by a batch admin, and typically:
- Submit batch jobs (such as Job, Workload, or custom controller CRDs) and monitor job status and outputs
- Select appropriate queue or resource flavor for jobs (based on guidance from batch admins)
- Optimize job specs for resource and performance needs
| Queue Type | Scope | Created By | Used For |
|---|---|---|---|
| ClusterQueue | Cluster-wide | Platform admin | Define shared compute capacity and quota management |
| LocalQueue | Namespace | Namespace owner | Enable workload submission, mapped to ClusterQueue |
Prerequisites
- An existing AKS cluster. If you don't have a cluster, create one using the Azure CLI, Azure PowerShell, or the Azure portal.
- Azure CLI installed on your local machine. To install or upgrade, see Install the Azure CLI.
- Helm version 3 or above installed.
Install Kueue with Helm
While most features and scheduling policies that you might require are enabled by default, some aren't like TopologyAwareScheduling. If needed, reconfigure your Kueue installation by changing the default Feature Gates or by configuring Kueue paramater values in the values.yaml file of the Helm chart.
Kueue supports multiple workload Frameworks that you need to explicitly enable to use Kueue’s scheduling and resource management capabilities when running MPI Operator MPIJobs, KubeRay's RayJob and more.
In this guide, Kueue is configured to include LocalQueueMetrics and Topology Aware Scheduling and frameworks from Kubeflow, Ray, and JobSet.
LocalQueueMetricsprovides detailed Prometheus metrics specific to the state and activity of LocalQueues, enabling fine-grained monitoring of workload admission, quota reservation, and resource utilization.TopologyAwareSchedulingallows scheduling of pods based on the topology of nodes in a pool or cluster to improve available bandwidth between the pods.
Note
Update version as needed: kueue/releases
Create and save a
values.yamlfile to optionally customize your Kueue configuration.cat <<EOF > values.yaml controllerManager: featureGates: - name: TopologyAwareScheduling enabled: true - name: LocalQueueMetrics enabled: true managerConfig: controllerManagerConfigYaml: | apiVersion: config.kueue.x-k8s.io/v1beta1 kind: Configuration integrations: frameworks: - batch/job - kubeflow.org/mpijob - ray.io/rayjob - ray.io/raycluster - jobset.x-k8s.io/jobset - kubeflow.org/paddlejob - kubeflow.org/pytorchjob - kubeflow.org/tfjob - kubeflow.org/xgboostjob - kubeflow.org/jaxjob EOFInstall the latest version of the Kueue controller and CRDs in a dedicated namespace using the
helm installcommand.LATEST_VERSION=$(curl -s https://api.github.com/repos/kubernetes-sigs/kueue/releases/latest | grep tag_name | cut -d '"' -f 4 | sed 's/^v//') helm install kueue oci://registry.k8s.io/kueue/charts/kueue \ --version=${LATEST_VERSION} \ --create-namespace --namespace=kueue-system \ --values values.yamlConfirm the deployment status using the
helm listcommand.helm list --namespace kueue-systemYour output should include a
Statusofdeployedand look like:Pulled: registry.k8s.io/kueue/charts/kueue:0.13.4 Digest: - NAME: kueue LAST DEPLOYED: - NAMESPACE: kueue-system STATUS: deployed REVISION: 1 TEST SUITE: None
Confirm deployment status
Verify that controller pods are running properly.
kubectl get deploy -n kueue-systemYour output should look similar to the following example output:
NAME READY UP-TO-DATE AVAILABLE AGE kueue-controller-manager 1/1 1 1 7sConfirm the installation of Kueue resources on your AKS cluster:
kubectl get crds | grep kueueYour output should include the following Kueue CRDs:
admissionchecks.kueue.x-k8s.io 2025-09-11T18:20:48Z clusterqueues.kueue.x-k8s.io 2025-09-11T18:20:48Z cohorts.kueue.x-k8s.io 2025-09-11T18:20:48Z localqueues.kueue.x-k8s.io 2025-09-11T18:20:48Z multikueueclusters.kueue.x-k8s.io 2025-09-11T18:20:48Z multikueueconfigs.kueue.x-k8s.io 2025-09-11T18:20:48Z provisioningrequestconfigs.kueue.x-k8s.io 2025-09-11T18:20:48Z resourceflavors.kueue.x-k8s.io 2025-09-11T18:20:48Z topologies.kueue.x-k8s.io 2025-09-11T18:20:48Z workloadpriorityclasses.kueue.x-k8s.io 2025-09-11T18:20:48Z workloads.kueue.x-k8s.io 2025-09-11T18:20:48Z
Uninstall Kueue
If you no longer need to use the Kueue controller manager or Kueue custom resources in your AKS cluster, you can uninstall the Helm repository and remove the dedicated namespace and resources.
Uninstall the Kueue Helm repository using the
helm uninstallcommand.helm uninstall kueue --namespace kueue-systemRemove the dedicated namespace and resources using the
kubectl deletecommand.kubectl delete namespace kueue-system