Install and Configure Kueue on Azure Kubernetes Service (AKS)

In this article, you learn how to install and configure Kueue to schedule batch workloads on an Azure Kubernetes Service (AKS) cluster. You also explore different Kueue concepts, installation methods to enable advanced Kueue features, and learn how to verify your deployments.

Important

Open-source software is mentioned throughout AKS documentation and samples. Software that you deploy is excluded from AKS service-level agreements, limited warranty, and Azure support. As you use open-source technology alongside AKS, consult the support options available from the respective communities and project maintainers to develop a plan.

For example, the Ray GitHub repository describes several platforms that vary in response time, purpose, and support level.

Microsoft takes responsibility for building the open-source packages that we deploy on AKS. That responsibility includes having complete ownership of the build, scan, sign, validate, and hotfix process, along with control over the binaries in container images. For more information, see Vulnerability management for AKS and AKS support coverage.

What are batch workloads?

Batch deployments are typically non-interactive workloads that are retriable, have a finite duration, and might experience spiky or bursty resource usage. These workloads include, but aren't limited to:

  • Data processing jobs.
  • Security vulnerability scans.
  • Media encoding or video transcoding.
  • Report generation or financial analysis.
  • GPU workloads that require all resources to be available and might tolerate a delayed start but can't tolerate partial GPU allocation.

These workloads are often modeled using a Kubernetes Job, CronJob, or custom resource definition (CRD) like RayJob or Kubeflow MPIJob. Batch deployments present the following set of distinct requirements from general purpose deployments:

  • Scheduling logic beyond selecting the first available node.
  • Fairness, queueing, and resource awareness.
  • Lifecycle awareness of jobs and pods.

The default AKS scheduler satisfies the requirements of Kubernetes services but provides limited configuration for batch workloads that require a job queueing system.

What is Kueue?

Kueue is an open-source Kubernetes-native job queueing project designed to manage batch workloads and ensure efficient, fair, and policy-driven scheduling in Kubernetes clusters. Kueue integrates with the Kubernetes scheduling ecosystem to coordinate resource allocation, prioritization, and capacity control for batch jobs.

Important

Open-source software is mentioned throughout AKS documentation and samples. Software that you deploy is excluded from AKS service-level agreements, limited warranty, and Azure support. As you use open-source technology alongside AKS, consult the support options available from the respective communities and project maintainers to develop a plan.

For example, the Ray GitHub repository describes several platforms that vary in response time, purpose, and support level.

Microsoft takes responsibility for building the open-source packages that we deploy on AKS. That responsibility includes having complete ownership of the build, scan, sign, validate, and hotfix process, along with control over the binaries in container images. For more information, see Vulnerability management for AKS and AKS support coverage.

Kueue introduces a two-level queuing model:

  • A ClusterQueue represents shared resource pools (such as CPU, memory, GPU quotas).
  • A LocalQueue represents a tenant-facing queue in a namespace (where users submit their batch jobs).

Workloads submitted to a LocalQueue are matched to a ClusterQueue to determine if they can be admitted.

Note

A LocalQueue is always needed for users to submit batch workloads, and the LocalQueue tells Kueue about which ClusterQueue to assign the job to. The ClusterQueue determines if sufficient resources are available for the job to be admitted and run.

Who can use Kueue?

Batch workload administrators (including platform or cluster administrators and DevOps engineers) and batch users (data scientists, developers, and ML engineers) can benefit from deploying workloads with Kueue on AKS.

A batch admin focuses on configuring, managing, and securing the platform-level infrastructure to support batch workloads, and have the following responsibilities:

  • Provision and manage AKS node pools.
  • Define resource quotas, ClusterQueues, and policies for workload isolation.
  • Tune autoscaling and cost-efficiency (such as the Cluster Autoscaler or Kueue quotas).
  • Monitor cluster and queue health.
  • Create and maintain templates and reusable workflows.

A batch user runs compute-intensive or parallel jobs using the platform-level infrastructure configured by a batch admin, and typically:

  • Submit batch jobs (such as Job, Workload, or custom controller CRDs) and monitor job status and outputs
  • Select appropriate queue or resource flavor for jobs (based on guidance from batch admins)
  • Optimize job specs for resource and performance needs
Queue Type Scope Created By Used For
ClusterQueue Cluster-wide Platform admin Define shared compute capacity and quota management
LocalQueue Namespace Namespace owner Enable workload submission, mapped to ClusterQueue

Prerequisites

Install Kueue with Helm

While most features and scheduling policies that you might require are enabled by default, some aren't like TopologyAwareScheduling. If needed, reconfigure your Kueue installation by changing the default Feature Gates or by configuring Kueue paramater values in the values.yaml file of the Helm chart.

Kueue supports multiple workload Frameworks that you need to explicitly enable to use Kueue’s scheduling and resource management capabilities when running MPI Operator MPIJobs, KubeRay's RayJob and more.

In this guide, Kueue is configured to include LocalQueueMetrics and Topology Aware Scheduling and frameworks from Kubeflow, Ray, and JobSet.

  • LocalQueueMetrics provides detailed Prometheus metrics specific to the state and activity of LocalQueues, enabling fine-grained monitoring of workload admission, quota reservation, and resource utilization.
  • TopologyAwareScheduling allows scheduling of pods based on the topology of nodes in a pool or cluster to improve available bandwidth between the pods.

Note

Update version as needed: kueue/releases

  1. Create and save a values.yaml file to optionally customize your Kueue configuration.

    cat <<EOF > values.yaml
    controllerManager:
      featureGates:
        - name: TopologyAwareScheduling
          enabled: true
        - name: LocalQueueMetrics
          enabled: true
      managerConfig:
        controllerManagerConfigYaml: |
          apiVersion: config.kueue.x-k8s.io/v1beta1
          kind: Configuration
          integrations:
            frameworks:
              - batch/job
              - kubeflow.org/mpijob
              - ray.io/rayjob
              - ray.io/raycluster
              - jobset.x-k8s.io/jobset
              - kubeflow.org/paddlejob
              - kubeflow.org/pytorchjob
              - kubeflow.org/tfjob
              - kubeflow.org/xgboostjob
              - kubeflow.org/jaxjob
    EOF
    
  2. Install the latest version of the Kueue controller and CRDs in a dedicated namespace using the helm install command.

    LATEST_VERSION=$(curl -s https://api.github.com/repos/kubernetes-sigs/kueue/releases/latest | grep tag_name | cut -d '"' -f 4 | sed 's/^v//')
    
    helm install kueue oci://registry.k8s.io/kueue/charts/kueue \
     --version=${LATEST_VERSION} \
    --create-namespace --namespace=kueue-system \
    --values values.yaml
    
  3. Confirm the deployment status using the helm list command.

    helm list --namespace kueue-system
    

    Your output should include a Status of deployed and look like:

    Pulled: registry.k8s.io/kueue/charts/kueue:0.13.4
    Digest: -
    NAME: kueue
    LAST DEPLOYED: -
    NAMESPACE: kueue-system
    STATUS: deployed
    REVISION: 1
    TEST SUITE: None
    

Confirm deployment status

  1. Verify that controller pods are running properly.

    kubectl get deploy -n kueue-system
    

    Your output should look similar to the following example output:

    NAME                           READY   UP-TO-DATE   AVAILABLE   AGE
    kueue-controller-manager       1/1     1            1           7s
    
  2. Confirm the installation of Kueue resources on your AKS cluster:

    kubectl get crds | grep kueue
    

    Your output should include the following Kueue CRDs:

    admissionchecks.kueue.x-k8s.io                   2025-09-11T18:20:48Z
    clusterqueues.kueue.x-k8s.io                     2025-09-11T18:20:48Z
    cohorts.kueue.x-k8s.io                           2025-09-11T18:20:48Z
    localqueues.kueue.x-k8s.io                       2025-09-11T18:20:48Z
    multikueueclusters.kueue.x-k8s.io                2025-09-11T18:20:48Z
    multikueueconfigs.kueue.x-k8s.io                 2025-09-11T18:20:48Z
    provisioningrequestconfigs.kueue.x-k8s.io        2025-09-11T18:20:48Z
    resourceflavors.kueue.x-k8s.io                   2025-09-11T18:20:48Z
    topologies.kueue.x-k8s.io                        2025-09-11T18:20:48Z
    workloadpriorityclasses.kueue.x-k8s.io           2025-09-11T18:20:48Z
    workloads.kueue.x-k8s.io                         2025-09-11T18:20:48Z
    

Uninstall Kueue

If you no longer need to use the Kueue controller manager or Kueue custom resources in your AKS cluster, you can uninstall the Helm repository and remove the dedicated namespace and resources.

  1. Uninstall the Kueue Helm repository using the helm uninstall command.

    helm uninstall kueue --namespace kueue-system  
    
  2. Remove the dedicated namespace and resources using the kubectl delete command.

    kubectl delete namespace kueue-system  
    

Next steps