有关 Azure Kubernetes 服务 (AKS) 中的高级计划程序功能的最佳做法Best practices for advanced scheduler features in Azure Kubernetes Service (AKS)

在 Azure Kubernetes 服务 (AKS) 中管理群集时,通常需要隔离团队和工作负荷。As you manage clusters in Azure Kubernetes Service (AKS), you often need to isolate teams and workloads. Kubernetes 计划程序提供高级功能,让你控制可在特定节点上计划哪些 Pod,或者如何在整个群集中适当地分配多 Pod 应用程序。The Kubernetes scheduler provides advanced features that let you control which pods can be scheduled on certain nodes, or how multi-pod applications can be appropriately distributed across the cluster.

本最佳做法文章重点介绍面向群集操作员的高级 Kubernetes 计划功能。This best practices article focuses on advanced Kubernetes scheduling features for cluster operators. 在本文中,学习如何:In this article, you learn how to:

  • 使用排斥和容许来限制可在节点上计划的 podUse taints and tolerations to limit what pods can be scheduled on nodes
  • 使用节点选择器或节点关联为特定节点上运行的 pod 分配优先顺序Give preference to pods to run on certain nodes with node selectors or node affinity
  • 使用 pod 间关联或反关联来拆离或组合 podSplit apart or group together pods with inter-pod affinity or anti-affinity

使用排斥和容许提供专用节点Provide dedicated nodes using taints and tolerations

最佳做法指导 - 限制资源密集型应用程序(例如入口控制器)对特定节点的访问。Best practice guidance - Limit access for resource-intensive applications, such as ingress controllers, to specific nodes. 将节点资源保留给需要它们的工作负荷使用,但不允许在节点上计划其他工作负荷。Keep node resources available for workloads that require them, and don't allow scheduling of other workloads on the nodes.

创建 AKS 群集时,可以部署支持 GPU 的节点或具有大量强大 CPU 的节点。When you create your AKS cluster, you can deploy nodes with GPU support or a large number of powerful CPUs. 这些节点通常用于大数据处理工作负荷,例如机器学习 (ML) 或人工智能 (AI)。These nodes are often used for large data processing workloads such as machine learning (ML) or artificial intelligence (AI). 由于此类硬件通常是需要部署的昂贵节点资源,因此需要限制可在这些节点上计划的工作负荷。As this type of hardware is typically an expensive node resource to deploy, limit the workloads that can be scheduled on these nodes. 你可能想要专门使用群集中的某些节点来运行入口服务,并阻止其他工作负荷。You may instead wish to dedicate some nodes in the cluster to run ingress services, and prevent other workloads.

这种对不同节点的支持通过使用多个节点池来提供。This support for different nodes is provided by using multiple node pools. AKS 群集提供一个或多个节点池。An AKS cluster provides one or more node pools.

Kubernetes 计划程序能够使用排斥和容许来限制可在节点上运行的工作负荷。The Kubernetes scheduler can use taints and tolerations to restrict what workloads can run on nodes.

  • 排斥应用到指明了只能计划特定 pod 的节点。A taint is applied to a node that indicates only specific pods can be scheduled on them.
  • 然后,将容许应用到可以容许节点排斥的 pod。A toleration is then applied to a pod that allows them to tolerate a node's taint.

将 pod 部署到 AKS 群集时,Kubernetes 只会在容许与排斥相符的节点上计划 pod。When you deploy a pod to an AKS cluster, Kubernetes only schedules pods on nodes where a toleration is aligned with the taint. 例如,假设你在 AKS 群集中为支持 GPU 的节点创建了一个节点池。As an example, assume you have a node pool in your AKS cluster for nodes with GPU support. 你定义了名称(例如 gpu),然后定义了计划值。You define name, such as gpu, then a value for scheduling. 如果将此值设置为 NoSchedule,当 pod 未定义相应的容许时,Kubernetes 计划程序无法在节点上计划 pod。If you set this value to NoSchedule, the Kubernetes scheduler can't schedule pods on the node if the pod doesn't define the appropriate toleration.

kubectl taint node aks-nodepool1 sku=gpu:NoSchedule

将排斥应用到节点后,在 pod 规范中定义容许,以允许在节点上进行计划。With a taint applied to nodes, you then define a toleration in the pod specification that allows scheduling on the nodes. 以下示例定义 sku: gpueffect: NoSchedule,以容许在上一步骤中应用到节点的排斥:The following example defines the sku: gpu and effect: NoSchedule to tolerate the taint applied to the node in the previous step:

kind: Pod
apiVersion: v1
metadata:
  name: tf-mnist
spec:
  containers:
  - name: tf-mnist
    image: dockerhub.azk8s.cn/microsoft/samples-tf-mnist-demo:gpu
    resources:
      requests:
        cpu: 0.5
        memory: 2Gi
      limits:
        cpu: 4.0
        memory: 16Gi
  tolerations:
  - key: "sku"
    operator: "Equal"
    value: "gpu"
    effect: "NoSchedule"

部署此 pod(例如,使用 kubectl apply -f gpu-toleration.yaml)后,Kubernetes 可以成功地在应用了排斥的节点上计划 pod。When this pod is deployed, such as using kubectl apply -f gpu-toleration.yaml, Kubernetes can successfully schedule the pod on the nodes with the taint applied. 通过这种逻辑隔离,可以控制对群集中资源的访问。This logical isolation lets you control access to resources within a cluster.

应用排斥时,请与应用程序开发人员和所有者协作,让他们在其部署中定义所需的容许。When you apply taints, work with your application developers and owners to allow them to define the required tolerations in their deployments.

若要详细了解如何在 AKS 中使用多个节点池,请参阅为 AKS 中的群集创建和管理多个节点池For more information about how to use multiple node pools in AKS, see Create and manage multiple node pools for a cluster in AKS.

AKS 中的排斥和容许的行为Behavior of taints and tolerations in AKS

升级 AKS 中的节点池时,排斥和容许在应用于新节点时遵循一个设定的模式:When you upgrade a node pool in AKS, taints and tolerations follow a set pattern as they're applied to new nodes:

  • 使用虚拟机规模集的默认群集Default clusters that use virtual machine scale sets

    • 可以从 AKS API 污染节点池,以使新横向扩展的节点接收 API 指定的节点污点。You can taint a nodepool from the AKS API, to have newly scaled out nodes receive API specified node taints.
    • 假设你的群集有两个节点 - node1node2Let's assume you have a two-node cluster - node1 and node2. 升级节点池。You upgrade the node pool.
    • 另外两个节点(node3 和 node4)将被创建,并且排斥会被分别传递。Two additional nodes are created, node3 and node4, and the taints are passed on respectively.
    • 原始 node1 和 node2 将被删除。The original node1 and node2 are deleted.
  • 不支持虚拟机规模集的群集Clusters without virtual machine scale set support

    • 同样,让我们假设你有一个双节点群集 - node1 和 node2。Again, let's assume you have a two-node cluster - node1 and node2. 在升级时,将创建另一个节点 (node3)。When you upgrade, an additional node (node3) is created.
    • node1 中的排斥将应用于 node3,然后 node1 将被删除。The taints from node1 are applied to node3, then node1 is then deleted.
    • 将创建另一个新节点(名为 node1,因为以前的 node1 被删除),并且 node2 排斥将应用于新的 node1Another new node is created (named node1, since the previous node1 was deleted), and the node2 taints are applied to the new node1. 然后,将删除 node2Then, node2 is deleted.
    • 实际上,node1 变成了 node3node2 变成了 node1In essence node1 becomes node3, and node2 becomes node1.

缩放 AKS 中的节点池时,排斥和容许不会转移,这是设计使然。When you scale a node pool in AKS, taints and tolerations do not carry over by design.

使用节点选择器和关联控制 pod 计划Control pod scheduling using node selectors and affinity

最佳做法指导 - 使用节点选择器、节点关联或 pod 间关联来控制节点上的 pod 计划。Best practice guidance - Control the scheduling of pods on nodes using node selectors, node affinity, or inter-pod affinity. 这些设置可让 Kubernetes 计划程序以逻辑方式(例如,按节点中的硬件)隔离工作负荷。These settings allow the Kubernetes scheduler to logically isolate workloads, such as by hardware in the node.

使用排斥和容许能够以硬分割的形式逻辑隔离资源 - 如果 pod 不容许某个节点的排斥,则不会在该节点上计划该 pod。Taints and tolerations are used to logically isolate resources with a hard cut-off - if the pod doesn't tolerate a node's taint, it isn't scheduled on the node. 另一种方法是使用节点选择器。An alternate approach is to use node selectors. 例如,可以标记节点来指示本地附加的 SSD 存储或大量内存,并在 pod 规范中定义节点选择器。You label nodes, such as to indicate locally attached SSD storage or a large amount of memory, and then define in the pod specification a node selector. 然后,Kubernetes 会在匹配的节点上计划这些 pod。Kubernetes then schedules those pods on a matching node. 与容许不同,没有匹配的节点选择器的 pod 可在标记的节点上计划。Unlike tolerations, pods without a matching node selector can be scheduled on labeled nodes. 通过这种行为可以使用节点上未用的资源,但会将优先级分配给定义匹配的节点选择器的 pod。This behavior allows unused resources on the nodes to consume, but gives priority to pods that define the matching node selector.

让我们查看具有大量内存的节点示例。Let's look at an example of nodes with a high amount of memory. 这些节点可向请求大量内存的 pod 分配优先顺序。These nodes can give preference to pods that request a high amount of memory. 为确保资源不会闲置,它们还允许运行其他 pod。To make sure that the resources don't sit idle, they also allow other pods to run.

kubectl label node aks-nodepool1 hardware=highmem

然后,pod 规范添加 nodeSelector 属性,以定义与节点上设置的标签匹配的节点选择器:A pod specification then adds the nodeSelector property to define a node selector that matches the label set on a node:

kind: Pod
apiVersion: v1
metadata:
  name: tf-mnist
spec:
  containers:
  - name: tf-mnist
    image: dockerhub.azk8s.cn/microsoft/samples-tf-mnist-demo:gpu
    resources:
      requests:
        cpu: 0.5
        memory: 2Gi
      limits:
        cpu: 4.0
        memory: 16Gi
    nodeSelector:
      hardware: highmem

使用这些计划程序选项时,请与应用程序开发人员和所有者协作,让他们正确定义其 pod 规范。When you use these scheduler options, work with your application developers and owners to allow them to correctly define their pod specifications.

有关使用节点选择器的详细信息,请参阅将 Pod 分配到节点For more information about using node selectors, see Assigning Pods to Nodes.

节点关联Node affinity

节点选择器是将 pod 分配到给定节点的基本方法。A node selector is a basic way to assign pods to a given node. 使用节点关联可以获得更高的灵活性。More flexibility is available using node affinity. 使用节点关联可以定义当 pod 无法与节点匹配时发生的情况。With node affinity, you define what happens if the pod can't be matched with a node. 可以要求 Kubernetes 计划程序与包含标记主机的 pod 相匹配。You can require that Kubernetes scheduler matches a pod with a labeled host. 或者,可以优先选择匹配,但如果不匹配,则允许在其他主机上计划 pod。Or, you can prefer a match but allow the pod to be scheduled on a different host if not match is available.

以下示例将节点关联设置为 requiredDuringSchedulingIgnoredDuringExecutionThe following example sets the node affinity to requiredDuringSchedulingIgnoredDuringExecution. 这种关联要求 Kubernetes 计划使用具有匹配标签的节点。This affinity requires the Kubernetes schedule to use a node with a matching label. 如果没有可用的节点,则 pod 必须等待计划继续。If no node is available, the pod has to wait for scheduling to continue. 若要允许在其他节点上计划 pod,可改为将值设置为 preferredDuringSchedulingIgnoreDuringExecution:To allow the pod to be scheduled on a different node, you can instead set the value to preferredDuringSchedulingIgnoreDuringExecution:

kind: Pod
apiVersion: v1
metadata:
  name: tf-mnist
spec:
  containers:
  - name: tf-mnist
    image: dockerhub.azk8s.cn/microsoft/samples-tf-mnist-demo:gpu
    resources:
      requests:
        cpu: 0.5
        memory: 2Gi
      limits:
        cpu: 4.0
        memory: 16Gi
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: hardware
            operator: In
            values: highmem

该设置的 IgnoredDuringExecution 部分表示当节点标签更改时,不应从节点中逐出 pod。The IgnoredDuringExecution part of the setting indicates that if the node labels change, the pod shouldn't be evicted from the node. Kubernetes 计划程序仅对所要计划的新 pod 使用更新的节点标签,对于已在节点上计划的 pod 则不使用。The Kubernetes scheduler only uses the updated node labels for new pods being scheduled, not pods already scheduled on the nodes.

有关详细信息,请参阅关联和反关联For more information, see Affinity and anti-affinity.

pod 间关联和反关联Inter-pod affinity and anti-affinity

Kubernetes 计划程序逻辑隔离工作负荷的最终方法之一是使用 pod 间关联或反关联。One final approach for the Kubernetes scheduler to logically isolate workloads is using inter-pod affinity or anti-affinity. 这些设置定义不应在具有现有匹配 pod 的节点上计划 pod,或者应该计划 pod。 The settings define that pods shouldn't be scheduled on a node that has an existing matching pod, or that they should be scheduled. 默认情况下,Kubernetes 计划程序会尝试在跨节点的副本集3中计划多个 pod。By default, the Kubernetes scheduler tries to schedule multiple pods in a replica set across nodes. 可围绕此行为定义更具体的规则。You can define more specific rules around this behavior.

同时使用 Azure Redis 缓存的 Web 应用程序就是一个很好的例子。A good example is a web application that also uses an Azure Cache for Redis. 可以使用 pod 反关联规则来请求 Kubernetes 计划程序跨节点分配副本。You can use pod anti-affinity rules to request that the Kubernetes scheduler distributes replicas across nodes. 然后,可以使用关联规则来确保在相应缓存所在的同一主机上计划每个 Web 应用组件。You can then use affinity rules to make sure that each web app component is scheduled on the same host as a corresponding cache. 跨节点的 pod 分配如以下示例所示:The distribution of pods across nodes looks like the following example:

节点 1Node 1 节点 2Node 2 节点 3Node 3
webapp-1webapp-1 webapp-2webapp-2 webapp-3webapp-3
cache-1cache-1 cache-2cache-2 cache-3cache-3

与使用节点选择器或节点关联相比,此示例是一种更复杂的部署。This example is a more complex deployment than the use of node selectors or node affinity. 部署可让你控制 Kubernetes 如何在节点上计划 pod,并可以逻辑隔离资源。The deployment gives you control over how Kubernetes schedules pods on nodes and can logically isolate resources. 有关这个使用 Azure Cache for Redis 的 Web 应用程序的完整示例,请参阅在同一节点上共置 PodFor a complete example of this web application with Azure Cache for Redis example, see Co-locate pods on the same node.

后续步骤Next steps

本文重点介绍了高级 Kubernetes 计划程序功能。This article focused on advanced Kubernetes scheduler features. 有关 AKS 中的群集操作的详细信息,请参阅以下最佳做法:For more information about cluster operations in AKS, see the following best practices: