Configure advanced scheduler profiles on Azure Kubernetes Service (AKS) (preview)

In this article, you learn how to deploy example scheduler profiles in Azure Kubernetes Service (AKS) to configure advanced scheduling behavior using in-tree scheduling plugins. This guide also explains how to verify the successful application of custom scheduler profiles targeting specific node pools or the entire AKS cluster.

Limitations

  • AKS currently doesn't manage the deployment of third-party schedulers or out-of-tree scheduling plugins.
  • AKS doesn't support in-tree scheduling plugins targeting the aks-system scheduler. This restriction is in place to help prevent unexpected changes to AKS add-ons enabled on your cluster.

Prerequisites

Install the aks-preview Azure CLI extension

Important

AKS preview features are available on a self-service, opt-in basis. Previews are provided "as is" and "as available," and they're excluded from the service-level agreements and limited warranty. AKS previews are partially covered by customer support on a best-effort basis. As such, these features aren't meant for production use. For more information, see the following support articles:

  1. Install the aks-preview extension using the az extension add command.

    az extension add --name aks-preview
    
  2. Update to the latest version of the aks-preview extension using the az extension update command.

    az extension update --name aks-preview
    

Register the User Defined Scheduler Configuration Preview feature flag

  1. Register the UserDefinedSchedulerConfigurationPreview feature flag using the az feature register command.

    az feature register --namespace "Microsoft.ContainerService" --name "UserDefinedSchedulerConfigurationPreview"
    

    It takes a few minutes for the status to show Registered.

  2. Verify the registration status using the az feature show command.

    az feature show --namespace "Microsoft.ContainerService" --name "UserDefinedSchedulerConfigurationPreview"
    
  3. When the status reflects Registered, refresh the registration of the Microsoft.ContainerService resource provider using the az provider register command.

    az provider register --namespace "Microsoft.ContainerService"
    

Enable scheduler profile configuration on an AKS cluster

You can enable schedule profile configuration on a new or existing AKS cluster.

  1. Create an AKS cluster with scheduler profile configuration enabled using the az aks create command with the --enable-upstream-kubescheduler-user-configuration flag.

    # Set environment variables
    export RESOURCE_GROUP=<resource-group-name>
    export CLUSTER_NAME=<aks-cluster-name>
    
    # Create an AKS cluster with schedule profile configuration enabled
    az aks create \
    --resource-group $RESOURCE_GROUP \ 
    --name $CLUSTER_NAME \
    --enable-upstream-kubescheduler-user-configuration \
    --generate-ssh-keys
    
  2. Once the creation process completes, connect to the cluster using the az aks get-credentials command.

    az aks get-credentials --resource-group $RESOURCE_GROUP --name $CLUSTER_NAME
    

Verify installation of the scheduler controller

  • After enabling the feature on your AKS cluster, verify the custom resource definition (CRD) of the scheduler controller was successfully installed using the kubectl get command.

    kubectl get crd schedulerconfigurations.aks.azure.com
    

    Note

    This command won't succeed if the feature wasn't successfully enabled in the previous section.

Configure node bin-packing

Node bin-packing is a scheduling strategy that maximizes resource utilization by increasing pod density on nodes, within the set configuration. This strategy helps improve cluster efficiency by minimizing wasted resources and lowering the operational cost of maintaining idle or underutilized nodes.

In this example, the configured scheduler prioritizes scheduling pods on nodes with high CPU usage. Explicitly, this configuration avoids underutilizing nodes that still have free resources and helps to make better use of the resources already allocated to nodes.

  1. Create a file named aks-scheduler-customization.yaml and paste in the following manifest:

    apiVersion: aks.azure.com/v1alpha1
    kind: SchedulerConfiguration
    metadata:
      name: upstream
    spec:
      profiles:
      - schedulerName: node-binpacking-scheduler
        pluginConfig:
        - name: NodeResourcesFit
          args:
            scoringStrategy:
              type: MostAllocated
              resources:
              - name: cpu
                weight: 1
    
    • NodeResourcesFit ensures that the scheduler checks if a node has enough resources to run the pod.
    • scoringStrategy: MostAllocated tells the scheduler to prefer nodes with high CPU resource usage. This helps achieve better resource utilization by placing new pods on nodes that are already "highly used".
    • Resources specifies that CPU is the primary resource being considered for scoring, and with a weight of 1, CPU usage is prioritized with a relatively equal level of importance in the scheduling decision.
  2. Apply the scheduling configuration manifest using the kubectl apply command.

    kubectl apply -f aks-scheduler-customization.yaml
    
  3. To target this scheduling mechanism for specific workloads, update your pod deployments with the following schedulerName:

    ...
    ...
        spec:
          schedulerName: node-binpacking-scheduler
    ...
    ...
    

Configure pod topology spread

Pod topology spread is a scheduling strategy that seeks to distribute pods evenly across failure domains (such as availability zones or regions) to ensure high availability and fault tolerance in the event of zone or node failures. This strategy helps prevent the risk of all replicas of a pod being placed in the same failure domain. For more configuration guidance, see the [Kubernetes Pod Topology Spread Constraints documentation] (https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/).

  1. Create a file named aks-scheduler-customization.yaml and paste in the following manifest:

    apiVersion: aks.azure.com/v1alpha1
    kind: SchedulerConfiguration
    metadata:
      name: upstream
    spec:
      rawConfig: |
        apiVersion: kubescheduler.config.k8s.io/v1
        kind: KubeSchedulerConfiguration
        profiles:
        - schedulerName: pod-distribution-scheduler
          - pluginConfig:
              - name: PodTopologySpread
                args:
                  apiVersion: kubescheduler.config.k8s.io/v1
                  kind: PodTopologySpreadArgs
                  defaultingType: List
                  defaultConstraints:
                    - maxSkew: 1
                      topologyKey: topology.kubernetes.io/zone
                      whenUnsatisfiable: ScheduleAnyway
    
    • PodTopologySpread plugin instructs the scheduler to try and distribute pods as evenly as possible across availability zones.
    • whenUnsatisfiable: ScheduleAnyway specifies schedule to schedule pods despite the inability to meet the topology constraints. This avoids pod scheduling failures when exact distribution isn't feasible.
    • List type applies the default constraints as a list of rules. The scheduler uses the rules in the order they're defined, and they apply to all pods that don’t specify custom topology spread constraints.
    • maxSkew: 1 means the number of pods can differ by at most 1 between any two zones.
    • topologyKey: topology.kubernetes.io/zone indicates that the scheduler should spread pods across availability zones.
  2. Apply the scheduling configuration manifest using the kubectl apply command.

    kubectl apply -f aks-scheduler-customization.yaml
    
  3. To target this scheduling mechanism for specific workloads, update your pod deployments with the following schedulerName:

    ...
    ...
        spec:
          schedulerName: pod-distribution-scheduler
    ...
    ...
    

Assign a scheduler profile to an entire AKS cluster

  1. In your scheduler profile configuration, update the schedulerName field as follows:

    ...
    ...
    `- schedulerName: default_scheduler` 
    ...
    ...
    
  2. Reapply the manifest using the kubectl apply command.

    kubectl apply -f aks-scheduler-customization.yaml
    

    Now, this configuration will become the default scheduling operation for your entire AKS cluster.

Configure multiple scheduler profiles

You can customize the upstream scheduler with multiple profiles and customize each profile with multiple plugins while using the same configuration file. In the following example, we create two scheduling profiles called scheduler-one and scheduler-two:

  • scheduler-one prioritizes placing pods across zones and nodes for balanced distribution with the following settings:

    • Enforces strict zonal distribution and preferred node distribution using PodTopologySpread.
    • Honors hard pod affinity rules and considers the soft affinity rules with InterPodAffinity.
    • Prefers nodes in specific zones to reduce cross-zone networking using NodeAffinity.
  • scheduler-two prioritizes placing pods on nodes with available storage, CPU, and memory resources for timely resource-efficient resource usage with the following settings:

    • Ensures pods are placed on nodes where PVCs can bind to PVs using VolumeBinding.
    • Validates that nodes and volumes satisfy zonal requirements using VolumeZone to avoid cross-zone storage access.
    • Prioritizes nodes based on CPU, memory, and ephemeral storage utilization, with NodeResourcesFit.
    • Favors nodes that already have the required container images using ImageLocality.

Note

You might need to adjust zones and other parameters based on your workload type.

  1. Create a file named aks-scheduler-customization.yaml and paste in the following manifest:

    apiVersion: aks.azure.com/v1alpha1
    kind: SchedulerConfiguration
    metadata:
      name: upstream
    spec:
      rawConfig: |
        apiVersion: kubescheduler.config.k8s.io/v1
        kind: KubeSchedulerConfiguration
        percentageOfNodesToScore: 40
        podInitialBackoffSeconds: 1
        podMaxBackoffSeconds: 8
        profiles:
          - schedulerName: scheduler-one
            plugins:
              multiPoint:
                enabled:
                  - name: PodTopologySpread
                  - name: InterPodAffinity
                  - name: NodeAffinity
            pluginConfig:
              # PodTopologySpread with strict zonal distribution        
              - name: PodTopologySpread
                args:
                  defaultingType: List
                  defaultConstraints:
                    - maxSkew: 2
                      topologyKey: topology.kubernetes.io/zone
                      whenUnsatisfiable: DoNotSchedule
                    - maxSkew: 1
                      topologyKey: kubernetes.io/hostname
                      whenUnsatisfiable: ScheduleAnyway                  
              - name: InterPodAffinity
                args:
                  hardPodAffinityWeight: 1
                  ignorePreferredTermsOfExistingPods: false
              - name: NodeAffinity
                args:
                  addedAffinity:
                    preferredDuringSchedulingIgnoredDuringExecution:
                      - weight: 100
                        preference:
                          matchExpressions:
                            - key: topology.kubernetes.io/zone
                              operator: In
                              values: [chinanorth3-1, chinanorth3-2, chinanorth3-3]
          - schedulerName: scheduler-two
            plugins:
              multiPoint:
                enabled:
                  - name: VolumeBinding
                  - name: VolumeZone
                  - name: NodeAffinity
                  - name: NodeResourcesFit
                  - name: PodTopologySpread
                  - name: ImageLocality
            pluginConfig:
              - name: PodTopologySpread
                args:
                  defaultingType: List
                  defaultConstraints:
                    - maxSkew: 1
                      topologyKey: kubernetes.io/hostname
                      whenUnsatisfiable: DoNotSchedule 
              - name: VolumeBinding
                args:
                  apiVersion: kubescheduler.config.k8s.io/v1
                  kind: VolumeBindingArgs
                  bindTimeoutSeconds: 300
              - name: NodeAffinity
                args:
                  apiVersion: kubescheduler.config.k8s.io/v1
                  kind: NodeAffinityArgs
                  addedAffinity:
                    preferredDuringSchedulingIgnoredDuringExecution:
                      - weight: 100
                        preference:
                          matchExpressions:
                            - key: topology.kubernetes.io/zone
                              operator: In
                              values: [chinanorth3-1, chinanorth3-2]
              - name: NodeResourcesFit
                args:
                  apiVersion: kubescheduler.config.k8s.io/v1
                  kind: NodeResourcesFitArgs
                  scoringStrategy:
                    type: MostAllocated
                    resources:
                      - name: cpu
                        weight: 3
                      - name: memory
                        weight: 1
                      - name: ephemeral-storage
                        weight: 2
    
  2. Apply the manifest using the kubectl apply command.

    kubectl apply -f aks-scheduler-customization.yaml
    

Disable an AKS scheduler profile configuration

  1. To disable the AKS scheduler profile configuration and revert to AKS scheduler default configuration on the cluster, first delete the schedulerconfiguration resource using the kubectl delete command.

    kubectl delete schedulerconfiguration upstream || true
    

    Note

    Ensure that the previous step is complete and confirm that the schedulerconfiguration resource was deleted before proceeding to disable this feature.

  2. Disable the feature using the az aks update command with the --disable-upstream-kubescheduler-user-configuration flag.

    az aks update --subscription="${SUBSCRIPTION_ID}" \
    --resource-group="${RESOURCE_GROUP}" \
    --name="${CLUSTER_NAME}" \
    --disable-upstream-kubescheduler-user-configuration
    
  3. Verify the feature is disabled using the az aks show command.

    az aks show --resource-group="${RESOURCE_GROUP}" \
    --name="${CLUSTER_NAME}" \
    --query='properties.schedulerProfile'
    

    Your output should indicate that the feature is no longer enabled on your AKS cluster.

Frequently asked questions (FAQ)

What happens if I apply misconfigured scheduler profile to my AKS cluster?

Once you apply a scheduler profile, AKS checks if it contains a valid configuration of plugins and arguments. If the configuration targets a disallowed scheduler or sets the in-tree scheduling plugins improperly, AKS rejects the configuration and reverts to the last known "accepted" scheduler configuration. This check aims to limit impact on new and existing AKS clusters due to scheduler misconfiguration.

How can I monitor and validate that the scheduler honored my configuration?

There are three recommended methods for observing the results of your applied scheduler profile:

  • View the AKS kube-scheduler control plane logs to ensure that the scheduler received the configuration from the CRD.
  • Run the kubectl get schedulerconfiguration command. The output displays the status of the configuration: pending during the rollout and Succeeded or Failed after the configuration is accepted or rejected by the scheduler.
  • Run the kubectl describe schedulerconfiguration command. The output displays a more detailed state of the scheduler, including any error during the reconciliation, and the current scheduler configuration in effect.

Next steps

To learn more about the AKS scheduler and best practices, see the following resources: