节点自动预配节点池配置

2025-08-29

本文介绍如何为 Azure Kubernetes 服务（AKS）节点自动预配（NAP）配置节点池，包括 SKU 选择器、资源限制和优先级权重。

节点池概述

NodePools 在节点自动预配创建的节点上设置约束，以及在这些节点上运行的 Pod。首次安装节点自动预配时，会创建默认的 NodePool。可以修改此 NodePool 或添加更多 NodePool。

NodePools 的主要行为：

节点自动预配至少需要一个 NodePool 才能正常运行
节点自动预配评估每个配置的 NodePool
节点自动预配会跳过 Pod 不允许的污点的 NodePool
节点自动预配将启动污点应用于预配的节点，但不需要 Pod 容忍
节点自动预配最适合相互排斥的 NodePools。当多个 NodePools 匹配时，将使用权重最高的节点池

节点自动预配使用虚拟机（VM）库存保留单元（SKU）要求，为挂起的工作负荷选择最佳虚拟机。配置 SKU 系列、VM 类型、现成或按需实例、体系结构和资源限制。

默认节点池配置

了解默认 NodePools

AKS 创建可以自定义的默认节点池配置：

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: default
spec:
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    expireAfter: Never
  template:
    spec:
      nodeClassRef:
        name: default

      # Requirements that constrain the parameters of provisioned nodes.
      # These requirements are combined with pod.spec.affinity.nodeAffinity rules.
      # Operators { In, NotIn, Exists, DoesNotExist, Gt, and Lt } are supported.
      # https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#operators
      requirements:
      - key: kubernetes.io/arch
        operator: In
        values:
        - amd64
      - key: kubernetes.io/os
        operator: In
        values:
        - linux
      - key: karpenter.sh/capacity-type
        operator: In
        values:
        - on-demand
      - key: karpenter.azure.com/sku-family
        operator: In
        values:
        - D

还会创建一个 system-surge 节点池来自动缩放系统池节点。

控制默认 NodePool 创建

该 --node-provisioning-default-pools 标志控制创建的默认节点自动预配 NodePools：

Auto （默认值）：创建两个标准 NodePool 以立即使用
None：未创建默认 NodePool - 必须定义自己的节点池

警告

从“自动”更改为“无”：如果将此设置从 Auto 现有群集更改为 None “无”，则不会自动删除默认的 NodePools。如果需要，必须手动删除它们。

综合 NodePool 示例

以下示例演示了所有可用的 NodePool 配置选项，并提供了详细的注释：

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: default
spec:
  # Template section that describes how to template out NodeClaim resources that Karpenter will provision
  # Karpenter will consider this template to be the minimum requirements needed to provision a Node using this NodePool
  # It will overlay this NodePool with Pods that need to schedule to further constrain the NodeClaims
  # Karpenter will provision to launch new Nodes for the cluster
  template:
    metadata:
      # Labels are arbitrary key-values that are applied to all nodes
      labels:
        billing-team: my-team

      # Annotations are arbitrary key-values that are applied to all nodes
      annotations:
        example.com/owner: "my-team"
    spec:
      nodeClassRef:
        apiVersion: karpenter.azure.com/v1beta1
        kind: AKSNodeClass
        name: default

      # Provisioned nodes will have these taints
      # Taints may prevent pods from scheduling if they are not tolerated by the pod.
      taints:
        - key: example.com/special-taint
          effect: NoSchedule

      # Provisioned nodes will have these taints, but pods do not need to tolerate these taints to be provisioned by this
      # NodePool. These taints are expected to be temporary and some other entity (e.g. a DaemonSet) is responsible for
      # removing the taint after it has finished initializing the node.
      startupTaints:
        - key: example.com/another-taint
          effect: NoSchedule

      # Requirements that constrain the parameters of provisioned nodes.
      # These requirements are combined with pod.spec.topologySpreadConstraints, pod.spec.affinity.nodeAffinity, pod.spec.affinity.podAffinity, and pod.spec.nodeSelector rules.
      # Operators { In, NotIn, Exists, DoesNotExist, Gt, and Lt } are supported.
      # https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#operators
      requirements:
        - key: "karpenter.azure.com/sku-family"
          operator: In
          values: ["D", "E", "F"]
          # minValues here enforces the scheduler to consider at least that number of unique sku-family to schedule the pods.
          # This field is ALPHA and can be dropped or replaced at any time 
          minValues: 2
        - key: "karpenter.azure.com/sku-name"
          operator: In
          values: ["Standard_D2s_v3","Standard_D4s_v3","Standard_E2s_v3","Standard_E4s_v3","Standard_F2s_v2","Standard_F4s_v2"]
          minValues: 5
        - key: "karpenter.azure.com/sku-cpu"
          operator: In
          values: ["2", "4", "8", "16"]
        - key: "karpenter.azure.com/sku-version"
          operator: Gt
          values: ["2"]
        - key: "topology.kubernetes.io/zone"
          operator: In
          values: ["chinanorth3-1", "chinanorth3-2"]
        - key: "kubernetes.io/arch"
          operator: In
          values: ["arm64", "amd64"]
        - key: "karpenter.sh/capacity-type"
          operator: In
          values: ["spot", "on-demand"]

      # ExpireAfter is the duration the controller will wait
      # before terminating a node, measured from when the node is created. This
      # is useful to implement features like eventually consistent node upgrade,
      # memory leak protection, and disruption testing.
      expireAfter: 720h

      # TerminationGracePeriod is the maximum duration the controller will wait before forcefully deleting the pods on a node, measured from when deletion is first initiated.
      # 
      # Warning: this feature takes precedence over a Pod's terminationGracePeriodSeconds value, and bypasses any blocked PDBs or the karpenter.sh/do-not-disrupt annotation.
      # 
      # This field is intended to be used by cluster administrators to enforce that nodes can be cycled within a given time period.
      # When set, drifted nodes will begin draining even if there are pods blocking eviction. Draining will respect PDBs and the do-not-disrupt annotation until the TGP is reached.
      # 
      # Karpenter will preemptively delete pods so their terminationGracePeriodSeconds align with the node's terminationGracePeriod.
      # If a pod would be terminated without being granted its full terminationGracePeriodSeconds prior to the node timeout,
      # that pod will be deleted at T = node timeout - pod terminationGracePeriodSeconds.
      # 
      # The feature can also be used to allow maximum time limits for long-running jobs which can delay node termination with preStop hooks.
      # If left undefined, the controller will wait indefinitely for pods to be drained.
      terminationGracePeriod: 30s

  # Disruption section which describes the ways in which Karpenter can disrupt and replace Nodes
  # Configuration in this section constrains how aggressive Karpenter can be with performing operations
  # like rolling Nodes due to them hitting their maximum lifetime (expiry) or scaling down nodes to reduce cluster cost
  disruption:
    # Describes which types of Nodes Karpenter should consider for consolidation
    # If using 'WhenEmptyOrUnderutilized', Karpenter will consider all nodes for consolidation and attempt to remove or replace Nodes when it discovers that the Node is underutilized and could be changed to reduce cost
    # If using `WhenEmpty`, Karpenter will only consider nodes for consolidation that contain no workload pods
    consolidationPolicy: WhenEmptyOrUnderutilized | WhenEmpty

    # The amount of time Karpenter should wait after discovering a consolidation decision
    # This value can currently only be set when the consolidationPolicy is 'WhenEmpty'
    # You can choose to disable consolidation entirely by setting the string value 'Never' here
    consolidateAfter: 30s

    # Budgets control the speed Karpenter can scale down nodes.
    # Karpenter will respect the minimum of the currently active budgets, and will round up
    # when considering percentages. Duration and Schedule must be set together.
    budgets:
    - nodes: 10%
    # On Weekdays during business hours, don't do any deprovisioning.
    - schedule: "0 9 * * mon-fri"
      duration: 8h
      nodes: "0"

  # Resource limits constrain the total size of the pool.
  # Limits prevent Karpenter from creating new instances once the limit is exceeded.
  limits:
    cpu: "1000"
    memory: 1000Gi

  # Priority given to the NodePool when the scheduler considers which NodePool
  # to select. Higher weights indicate higher priority when comparing NodePools.
  # Specifying no weight is equivalent to specifying a weight of 0.
  weight: 10

支持的节点预配程序要求

Kubernetes 定义 Azure 实现的Well-Known 标签。在 NodePool API 的“spec.requirements”部分中定义这些标签。

除了 Kubernetes 中的已知标签，Node Auto Provisioning 还支持特定于 Azure 的标签，以便进行更高级的计划。

具有已知标签的 SKU 选择器

Selector	Description	Example
karpenter.azure.com/sku-family	VM SKU 系列	D、F、L 等
karpenter.azure.com/sku-name	显式 SKU 名称	Standard_A1_v2
karpenter.azure.com/sku-version	SKU 版本（不含“v”，可使用 1）	1、2
karpenter.sh/capacity-type	VM 分配类型（现成/按需）	现成或按需
karpenter.azure.com/sku-cpu	VM 中的 CPU 数	16
karpenter.azure.com/sku-memory	VM 中的内存 (MiB)	131072
karpenter.azure.com/sku-gpu-name	GPU 名称	A100
karpenter.azure.com/sku-gpu-manufacturer	GPU 制造商	nvidia
karpenter.azure.com/sku-gpu-count	每个 VM 的 GPU 计数	2
karpenter.azure.com/sku-networking-accelerated	VM 是否具有加速网络	[true、false]
karpenter.azure.com/sku-storage-premium-capable	VM 是否支持高级 IO 存储	[true、false]
karpenter.azure.com/sku-storage-ephemeralos-maxsize	临时 OS 磁盘的大小限制 (Gb)	92
topology.kubernetes.io/zone	可用性区域	[chinanorth3-1，chinanorth3-2，chinanorth3-3]
kubernetes.io/os	作系统（仅在预览版期间 Linux）	Linux
kubernetes.io/arch	CPU 体系结构（AMD64 或 ARM64）	[amd64， arm64]

已知标签

实例类型

键： node.kubernetes.io/instance-type
键： karpenter.azure.com/sku-family
键： karpenter.azure.com/sku-name
键： karpenter.azure.com/sku-version

通常，实例类型应为列表，而不是单个值。建议不定义这些要求，因为它可最大程度地选择有效地放置 Pod。

不支持大多数 Azure VM 大小，但不包括不支持 AKS 的专用大小。

SKU 系列示例

选择 karpenter.azure.com/sku-family 器允许以特定 VM 系列为目标：

D 系列：具有均衡 CPU 与内存比率的常规用途 VM
F 系列：CPU 与内存比率较高的计算优化 VM
E 系列：针对内存密集型应用程序的内存优化 VM
L 系列：具有高磁盘吞吐量的存储优化 VM
N 系列：为计算密集型工作负荷启用 GPU 的 VM

示例配置：

requirements:
- key: karpenter.azure.com/sku-family
  operator: In
  values:
  - D
  - F

SKU 名称示例

选择 karpenter.azure.com/sku-name 器允许指定确切的 VM 实例类型：

requirements:
- key: karpenter.azure.com/sku-name
  operator: In
  values:
  - Standard_D4s_v3
  - Standard_F8s_v2

SKU 版本示例

选择 karpenter.azure.com/sku-version 器面向特定代系：

requirements:
- key: karpenter.azure.com/sku-version
  operator: In
  values:
  - "3"  # v3 generation
  - "5"  # v5 generation

可用性区域

键： topology.kubernetes.io/zone
值示例： chinanorth3-1
值列表：用于 az account list-locations --output table 查看可用区域

可以将节点自动预配配置为在特定区域中创建节点。 Azure 订阅的可用性区域 chinanorth3-1 可能与另一个 Azure 订阅的物理位置 chinanorth3-1 不同。

Architecture

键： kubernetes.io/arch
值：
- amd64
- arm64

节点自动预配支持和amd64arm64节点。

操作系统

键： kubernetes.io/os
值：
- linux

节点自动预配支持 linux 作系统（Ubuntu + AzureLinux）。即将推出 Windows 支持。

容量类型

键： karpenter.sh/capacity-type
值：
- spot
- on-demand

如果 NodePool 允许现成实例和按需实例，节点自动预配将优先选择现成产品/服务。

节点池限制

默认情况下，节点自动预配会尝试在可用的 Azure 配额内计划工作负荷。还可以通过在节点池规范中指定限制，从而指定节点池使用的资源上限。

spec:
  # Resource limits constrain the total size of the cluster.
  # Limits prevent Node Auto Provisioning from creating new instances once the limit is exceeded.
  limits:
    cpu: "1000"
    memory: 1000Gi

节点池权重

定义多个节点池后，可以设置应安排工作负载的首选位置。定义节点池定义的相对权重。

spec:
  # Priority given to the node pool when the scheduler considers which to select. 
  # Higher weights indicate higher priority when comparing node pools.
  # Specifying no weight is equivalent to specifying a weight of 0.
  weight: 10

示例节点池配置

启用了 GPU 的节点池

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-pool
spec:
  weight: 20
  limits:
    cpu: "500"
    memory: 500Gi
  template:
    spec:
      nodeClassRef:
        name: default
      requirements:
      - key: karpenter.azure.com/sku-gpu-manufacturer
        operator: In
        values:
        - nvidia
      - key: karpenter.azure.com/sku-gpu-count
        operator: Gt
        values:
        - "0"

现成实例节点池

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: spot-pool
spec:
  weight: 5
  template:
    spec:
      nodeClassRef:
        name: default
      requirements:
      - key: karpenter.sh/capacity-type
        operator: In
        values:
        - spot
        - on-demand # Spot nodepools with on-demand configured will fallback to on-demand capacity
      - key: karpenter.azure.com/sku-family
        operator: In
        values:
        - D
        - F

共用方式為